feat: add HTTP event recorder reporting #97

rvagg · 2022-07-12T11:15:55Z

Closes: #96

I thought I'd have a go at this, I think we discussed dividing it up this way since @kylehuntsman is so keen to do filecoin-project/tornado-deploy#2 (and I'm not). This adds an HTTP PUT to https://foo.bar/retrieval-event/~uuid~/providers/~minerid for both successful and unsuccessful retrievals, using data I could collect at that point and seems reasonable to want. We discussed, (and I think agreed on?) flattening it to just the one endpoint and letting the server handle aggregation of uuids. The URL string before /retrieval-event... comes from the config, a new, optional field: event-recorder-endpoint-url—if not set, it's all a noop.

See the event struct for what's included in the message. One thing I couldn't add in is "stage" because I couldn't figure out what that might refer to!

kylehuntsman · 2022-07-12T23:36:11Z

After discussing in a little more detail, we came up with a flat ProviderRetrieval model that +/- some of the properties in the original proposal here. This is not a finalized model and can be iterated on!

ProviderRetrieval
{
   providerPeerID: peerID
   retrievalId: UUID
   cid: CID
   autoretrieveInstanceId: string
   lastStage: string
   errorMessage: string
   startedAt: datetime
   endedAt: datetime
   sizeBytes: bytes
   completed: boolean
}

codecov-commenter · 2022-07-15T05:51:17Z

Codecov Report

Merging #97 (f94ff73) into master (59ea9dd) will increase coverage by 4.41%.
The diff coverage is 29.60%.

@@            Coverage Diff             @@
##           master      #97      +/-   ##
==========================================
+ Coverage    7.80%   12.22%   +4.41%     
==========================================
  Files          12       16       +4     
  Lines        1767     2102     +335     
==========================================
+ Hits          138      257     +119     
- Misses       1626     1831     +205     
- Partials        3       14      +11

Impacted Files	Coverage Δ
autoretrieve.go	`0.00% <0.00%> (ø)`
bitswap/provider.go	`0.00% <0.00%> (ø)`
config.go	`0.00% <0.00%> (ø)`
...ecoin/eventrecorder/filelogserver/filelogserver.go	`0.00% <0.00%> (ø)`
filecoin/retriever.go	`0.00% <0.00%> (ø)`
filecoin/retrieverevents.go	`0.00% <0.00%> (ø)`
filecoin/activeretrievals.go	`69.84% <69.84%> (ø)`
filecoin/eventrecorder/eventrecorder.go	`72.09% <72.09%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59ea9dd...f94ff73. Read the comment docs.

rvagg · 2022-07-15T08:50:24Z

currently depends on application-research/filclient#86 for the filclient dependency

rvagg · 2022-07-18T12:25:36Z

I'd like to discuss what we're actually reporting since we have both the query-ask phase and the actual retrieval phase. Also, there's a lot of disparate info to be collected through these events and I'm not really sure a flat endpoint makes all that much sense. To illustrate, I've broken the (possible) reporting into an interface where you can see what we could report at each point: https://github.com/application-research/autoretrieve/pull/97/files#diff-de7628819e4344a18be7e7f90c87c7c502d5bee26a94a4eb1cce9031ca1646c2R15-R57

Note there's already some overlap there - the cid keeps getting repeated when maybe we should just be doing that with the start event(s).

Currently I'm bundling this all into a single message type, with lots of optional (empty) fields depending on the event. In IPLD schema terms that's here: https://github.com/application-research/autoretrieve/pull/97/files#diff-15ae65e0803959b92bfa25397445c590a3f793379e1129a7e23252766f0efcd9R37-R69

But then there's also the URL. I've started with the proposed, but have varied it to add the query-ask phase. So we have these 3 variations:

https://.../query-event/~uuid~ (start of query-ask, we haven't asked anyone yet but we want a timestamp)
https://.../query-event/~uuid~/providers/~peerID~ (during query-ask, in parallel to all of the candidates)
https://.../retrieval-event/~uuid~/providers/~peerID~ (during retrieval to each candidate until we get a success)

(code still not really ready for review, dealing with nasty race problems trying to match filclient's event reporting with the data we have available here)

hannahhoward

My overall comment is the schema here is too complicated, and we should just use the one @kylehuntsman proposed in comments (mirroring #96 ) and is presumably building his side toward. Most of the need for different interfaces for seperate query / retrieval endpoints arises from trying to record all this data, which I don't think we need. I also see data relevant to the overall autoretrieve attempt that we really don't need (like candidateCount) -- again, our unit here is a single provider + cid attempt, and everything else is just tied through the UUID. If we want to record information specific to the overall autoretrieve attempt, we can make a seperate endpoint later.

I don't want to simplify my feedback down to "let's follow the brief" but I think that's what's needed for coordinating tickets across an API boundary. If not we should have a sync discussion to agree on changes to @kylehuntsman 's proposed API (based on #96) cause I don't see us getting compatibility otherwise.

Re: filclient race conditions: is there a reason we cannot just use filclient events entirely and drop the calls to record things inside the retriever code (other than perhaps initialization?). Like we should probably use the RetrievalEventSuccess/RetrievalEventFailure instead of dropping them no? I guess we still have to have the code to do the "Start" event but I think we'll get better luck on race conditions out of staying in one system rather than mixing and matching. I think.

I believe the only condition we have to address specifically within autoretrieve code is when a query was made but no retrieval was attempted cause another peer was successful first. I think we should be able to handle that by setting the ProviderRetrieval for all candidates we skipped to some kind of "did not attempt retrieval" status at the end of retrieveFromBestCandidate

hannahhoward · 2022-07-18T18:18:17Z

filecoin/retrieverevents.go

+func (em *eventManager) fireEvent(cb func(timestamp time.Time, listener RetrievalEventListener)) {
+	timestamp := time.Now()
+	go func() {
+		em.lk.RLock()


this feels quite racy, cause messages may publish out of order

recommend a single additional go routine that reads messages over a channel and publishes them. You pay need to seperate the read from channel from the publish step to avoid blocking the channel. This is a good reference: https://github.com/filecoin-project/go-statemachine/blob/master/fsm/fsm.go#L149-L173 or you can also use a sync.Cond which I find is a good concurrency primitive for queues.

also, unless Listeners get added/removed frequently, I don't think it's a probably to just call the callback inside RLock cause multiple readers can still read at once.

rvagg · 2022-07-18T21:23:56Z

@hannahhoward so this is still a draft, I'm not proposing a final form of this here, just wanting discussion on the options available. I think probably a sync discussion is in order:

In here I'm currently reporting the data that's available, I know it's well beyond the original ask but I want to illustrate what we could be collecting and using for analysis. After collecting a lot of this myself I'm seeing a lot of utility for some of the additional data we could be adding. I can pull it back pretty easily from what's here now, just wanting to illustrate.
I can't use the events for filclient for the full flow of this, there isn't enough information in them - e.g. size on a success or error on a failure. It overcomplicates trying to marry the async events with the sync retrieval calls.
The two phases of retrieval add a lot of additional complication - (1) query & (2) actual retrieval - we have a flurry of events for query to multiple SPs in parallel and then serial retrievals after that. But the async event reporting from filclient becomes a mess because we have overlapping events that are hard to disambiguate. And we really need to be clear on what we're actually tracking and reporting to figure out which bits to include and which to not. Either way we have to deal (in here, and in the reporting db if we report it) with multiple connect and success/fail events for a single SP, possibly multiple per full retrieval flow.

rvagg · 2022-07-20T13:01:53Z

Still not quite done & ready for proper review, but working and reporting as we want it I believe.

Still relies on feat: add EventType to retrieval events: query & retrieve filclient#87, where there's even more work now (see notes in there)
Most interesting work is in filecoin/retrieval.go for now, although there's some additional debugging stuff that I'm intending on removing (see the runningRetrievals map) .. although it's been quite useful for chasing down leaks so I'm now wondering about keeping it.
The event reporting now relies almost entirely on the events coming out of filclient, rather than using return values & errors from the calls to filclient, "almost" because there's a failure case that's not covered - timeouts, I've opted to locally simulate a failure event for that one
The concurrent-requests-to-miner accounting uses the same infrastructure for event reporting, since that's where all that useful data is
Actual event posting code needs some more work before this is ready - maybe it should switch to a channel for posting so we don't risk ending up with a large pool of goroutines trying to post if we have a backlog for whatever reason - but for now it works and that's the important thing.
Report data now matches what we've been discussing in Build a simple database + Rest API for autoretrieve recording filecoin-project/tornado-deploy#2 (comment) (it should at least)

tbh most of the time I've spent on this today, and a bit of yesterday, is chasing down leaks caused by incomplete event coverage between filclient and here. I've perfected some methods for tracking those using both logs and some new debug reporting in here and I'm fairly confident that the coverage is spot on now.

I have an outstanding concern about the query calls, which we don't apply a timeout to, at least I'm not seeing it if it's there. So far I'm not seeing it as a cause of leaks though, perhaps there's an implicit socket timeout libp2p is giving us. But we may want to consider addressing that—in the current code here there's a commented-out block that uses a deadline to impose the timeout (which is wired up through filclient already too). If we care about quick retrieval we should be fairly ruthless on query timeouts; but if we care about data collection then some lenience might be in order too! Hopefully our data collection shows us query times nicely.

Lastly, I have an example log output for this - using the filelogserver.go in this PR to generate a line-delimited JSON file containing the events POSTed to it: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1

rvagg · 2022-07-20T13:14:50Z

filecoin/retriever.go

+				minerCfgs := retriever.config.MinerConfigs[candidate.MinerPeer.ID]
+				if minerCfgs.RetrievalTimeout != 0 {
+					var cancelFunc func()
+					ctx, cancelFunc = context.WithDeadline(ctx, time.Now().Add(minerCfgs.RetrievalTimeout))
+					defer cancelFunc()
+				}


maybe something for a separate PR, this seems like an oversight

hannahhoward

Seems like a lot of code in the activeretrievalmanager but also maybe it makes sense for me to wait and see the cleanup.

Overall all is well with the api which is what matters most.

filecoin/eventrecorder/eventrecorder.go

filecoin/retriever.go

rvagg · 2022-07-21T06:17:24Z

Cleaned up, simplified a little and added more tests. The only remaining thing I wouldn't mind doing is linking it to a filclient branch, this currently doesn't compile in CI and requires a replace in go.mod to my fork @ application-research/filclient#87. If/when I get write access I'll push a branch there and update here so it'll work properly.

Event log output should be the same as yesterday so my gist sample should still apply: https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1

rvagg · 2022-07-21T13:40:01Z

This now has the branch @ application-research/filclient#88 in go.mod, so it's compiling and testing in CI here and shouldn't have all of the error messages inline which should make it easier to review.

hannahhoward

Mostly LGTM, but I am worried about the noted potential memory leak. Can you fix or note why my analysis is wrong?

filecoin/activeretrievals.go

filecoin/eventrecorder/eventrecorder.go

filecoin/retrieverevents.go

Closes: #96

this has to be wound back because we can't upgrade ipld-prime yet due to complex dependency conflicts

rvagg requested review from hannahhoward and kylehuntsman July 12, 2022 11:15

rvagg force-pushed the rvagg/eventrecorder branch from 9483b90 to 57b99a5 Compare July 15, 2022 05:44

rvagg force-pushed the rvagg/eventrecorder branch 3 times, most recently from bcf6f37 to b500340 Compare July 18, 2022 10:15

hannahhoward requested changes Jul 18, 2022

View reviewed changes

rvagg force-pushed the rvagg/eventrecorder branch 2 times, most recently from f8d4dc2 to 63a826c Compare July 20, 2022 12:45

rvagg mentioned this pull request Jul 20, 2022

Build a simple database + Rest API for autoretrieve recording filecoin-project/tornado-deploy#2

Closed

rvagg commented Jul 20, 2022

View reviewed changes

hannahhoward reviewed Jul 21, 2022

View reviewed changes

filecoin/eventrecorder/eventrecorder.go Outdated Show resolved Hide resolved

filecoin/retriever.go Outdated Show resolved Hide resolved

filecoin/retriever.go Outdated Show resolved Hide resolved

rvagg force-pushed the rvagg/eventrecorder branch from 63a826c to af5b805 Compare July 21, 2022 06:06

rvagg marked this pull request as ready for review July 21, 2022 06:15

rvagg force-pushed the rvagg/eventrecorder branch from af5b805 to e497ff8 Compare July 21, 2022 12:53

hannahhoward requested changes Jul 21, 2022

View reviewed changes

filecoin/activeretrievals.go Outdated Show resolved Hide resolved

filecoin/activeretrievals.go Show resolved Hide resolved

filecoin/eventrecorder/eventrecorder.go Outdated Show resolved Hide resolved

filecoin/retrieverevents.go Show resolved Hide resolved

rvagg added 6 commits July 22, 2022 10:06

feat: add HTTP event recorder reporting

f476e2d

Closes: #96

temp: with bindnoderegistry

fec244c

this has to be wound back because we can't upgrade ipld-prime yet due to complex dependency conflicts

fix: use bindnode for msgs & ipld DeepEqual in tests

6a36fa7

update filclient for latest events

b51f6aa

feat: allow GOLOG_LOG_LEVEL through docker-compose

03513b9

fix: don't use interface for EventRecorder

b3869d1

rvagg added 10 commits July 22, 2022 10:06

feat: add filelogserver & fix http client

96faadf

feat: report full progress events throughout retrieval

bf57fbd

fix: rework activeRetrievals logic to track events

2814670

feat: use event listerner pattern, add query pattern

14280a0

feat: activeRetrievalsManager & new reporting format

c0e93a8

fix: add debugging, handle timeout failure properly

c2c0cef

feat: extract, document and simplify ActiveRetrievalsManager

846de12

chore: simplify EventManager, use single goroutine

817051d

fix: make tests pass, simplify query event fn names

36844c3

chore: add ARM tests, clean up naming

3ed50a1

rvagg force-pushed the rvagg/eventrecorder branch from e497ff8 to fc04363 Compare July 22, 2022 00:26

fix: update filclient, address review feedback

f44fbe3

rvagg force-pushed the rvagg/eventrecorder branch from fc04363 to f44fbe3 Compare July 22, 2022 00:33

fix: fix some error message log output

f94ff73

hannahhoward approved these changes Jul 22, 2022

View reviewed changes

hannahhoward merged commit 93fecff into master Jul 27, 2022

rvagg mentioned this pull request Jan 27, 2023

feat(events): consolidate to events pkg, remove unnecessary code filecoin-project/lassie#41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HTTP event recorder reporting #97

feat: add HTTP event recorder reporting #97

rvagg commented Jul 12, 2022 •

edited

kylehuntsman commented Jul 12, 2022 •

edited

codecov-commenter commented Jul 15, 2022 •

edited

rvagg commented Jul 15, 2022

rvagg commented Jul 18, 2022

hannahhoward left a comment

hannahhoward Jul 18, 2022

hannahhoward Jul 18, 2022

rvagg commented Jul 18, 2022

rvagg commented Jul 20, 2022 •

edited

rvagg Jul 20, 2022

hannahhoward left a comment

rvagg commented Jul 21, 2022

rvagg commented Jul 21, 2022

hannahhoward left a comment

feat: add HTTP event recorder reporting #97

feat: add HTTP event recorder reporting #97

Conversation

rvagg commented Jul 12, 2022 • edited

kylehuntsman commented Jul 12, 2022 • edited

codecov-commenter commented Jul 15, 2022 • edited

Codecov Report

rvagg commented Jul 15, 2022

rvagg commented Jul 18, 2022

hannahhoward left a comment

Choose a reason for hiding this comment

hannahhoward Jul 18, 2022

Choose a reason for hiding this comment

hannahhoward Jul 18, 2022

Choose a reason for hiding this comment

rvagg commented Jul 18, 2022

rvagg commented Jul 20, 2022 • edited

rvagg Jul 20, 2022

Choose a reason for hiding this comment

hannahhoward left a comment

Choose a reason for hiding this comment

rvagg commented Jul 21, 2022

rvagg commented Jul 21, 2022

hannahhoward left a comment

Choose a reason for hiding this comment

rvagg commented Jul 12, 2022 •

edited

kylehuntsman commented Jul 12, 2022 •

edited

codecov-commenter commented Jul 15, 2022 •

edited

rvagg commented Jul 20, 2022 •

edited