Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Control Protocol #29

Merged
merged 19 commits into from
May 24, 2022
Merged

V2 Control Protocol #29

merged 19 commits into from
May 24, 2022

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Mar 15, 2022

V2 Control Protocol

This is a proposal on the new design of the control protocol.

Features

  • A unique state/config per unit (multiple inputs and output per process).
  • A similar actions protocol (only added 1 extra field).
  • A key-value storage scoped per unit for persistent storage.
  • Access to the artifacts API through the control protocol.
  • Ability to report name, version and metadata about the program to Elastic Agent.
  • Ability to write log messages from processes that are not started as sub-processes.

Backward Compatibility

This adds a new CheckinV2 that is to be used when the Elastic Agent is operating in V2 mode. The original Checkin will be deprecated at some point in the future (once all tools have migrated off).

Closes elastic/elastic-agent#215

@blakerouse blakerouse added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Mar 15, 2022
@blakerouse blakerouse self-assigned this Mar 15, 2022
@blakerouse
Copy link
Contributor Author

We need to decide on how we want to merge this into the repository. I was thinking about making it github.com/elastic/elastic-agent-client/v2 keeping the current client as github.com/elastic/elastic-agent-client.

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 15, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-05-19T13:38:20.114+0000

  • Duration: 3 min 37 sec

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@ph
Copy link
Contributor

ph commented Mar 15, 2022

@blakerouse I think we should work the same as our other project, something like:

  • create a 1.x branch.
  • Tag the current code for v1.0.0 ?
  • Update the project to use that version.
  • Use main for v2?

@cmacknz
Copy link
Member

cmacknz commented Mar 15, 2022

Use main for v2?

This will work but I've usually seen public proto packages versioned like Go packages, where v1 and v2 coexist together in the tree until one obsoletes the other. Like Go modules, this will make the import path change between versions. You could use package proto; and package proto.v2; for example and do the same on the generated Go modules.

Here is Google Cloud Storage using this approach: https://github.com/googleapis/googleapis/tree/master/google/storage

@ph
Copy link
Contributor

ph commented Mar 15, 2022

@cmacknz good point, but at the same time we have a limited set of users, so I am not sure having a v2 worth it. I believe this would reduce our flexibility to make changes in repo/testing and so on.

@blakerouse
Copy link
Contributor Author

I am +1 for using main for v2.

@cmacknz
Copy link
Member

cmacknz commented Mar 15, 2022

@cmacknz good point, but at the same time we have a limited set of users, so I am not sure having a v2 worth it. I believe this would reduce our flexibility to make changes in repo/testing and so on.

Fair point, go with whatever makes development easier.

elastic-agent-client.proto Outdated Show resolved Hide resolved
@blakerouse
Copy link
Contributor Author

@jlind23 Had a good idea in our 1-1 to instead of making a whole v2, that we instead try to make this work with the existing v1 instead of a hard breaking change.

He had some good points of allowing things like Endpoint Security and APM server to migrate slower to v2. I think this is a valid point and something that we could do in this protocol.

The change only adds 1 field to Actions stream so we can easily add that and pass it to current processes and they will ignore it (which is going to be okay, until they support the v2 protocol anyway).

The new additional functions are not streaming protocols that are even required for processes to even use, so if they don't use them that is fine and not breaking.

The largest change is the Checkin stream, because of how state and configuration is changed. The idea I have is to keep the Checkin stream as it is today and then add a CheckinV2 stream. A newer process will connect to V2 and an older one that has not migrated will continue to connect to old Checkin. We then deprecate the old Checkin and we give a EOL for that Checkin in some future 8.x version.

elastic-agent-client.proto Outdated Show resolved Hide resolved
@ph
Copy link
Contributor

ph commented Mar 15, 2022

@blakerouse The v1,v2 I am worried to run in a hybrid mode, wouldn't that bring more complexity while debugging?

@jlind23
Copy link
Contributor

jlind23 commented Mar 15, 2022

@ph we will be aware when agent will be running in V2 mode, we just have to properly log it then.

@blakerouse
Copy link
Contributor Author

@ph Really we have 2 different ways we could implement it.

  1. Run in v2 mode explicitly, so even though the Checkin function is present, we do not allow connection to it over the GRPC. Elastic Agent will error to the client that its in v2 mode and must connect over CheckinV2.
  2. Run in hybrid mode that allows a v1 client but it translates to a v2 operating mode. So Elastic Agent runs in v2 mode, it computes the multiple units per process and builds the runtime environment. Then that v2 runtime environment is then merged to send a single configure to a v1 client, where the status is duplicated across the v2 runtime. This gives a temporary translation layer for v1 clients.

The open question is really do we want to support existing v1 clients when operating as v2, if so #2. Or we don't need too and #1 could be the answer. Or option #3, start with #1 and if we see an issue that requires us to do #2 then we implement the translation layer.

elastic-agent-client.proto Show resolved Hide resolved
elastic-agent-client.proto Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
//
// Transactional store is provided to allow multiple key operations to occur before a commit to ensure consistent
// state when multiple keys make up the state of an units persistent state.
rpc StoreBeginTxn(StoreBeginTxnRequest) returns (StoreBeginTxnResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question here: must all the store methods be called within a transaction, or could the Get|Set|Delete be called alone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must be in a transaction.

elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
elastic-agent-client.proto Outdated Show resolved Hide resolved
blakerouse and others added 3 commits March 17, 2022 09:08
Co-authored-by: Anderson Queiroz <contato@andersonq.eti.br>
Co-authored-by: Anderson Queiroz <contato@andersonq.eti.br>
Co-authored-by: Anderson Queiroz <contato@andersonq.eti.br>
@ph
Copy link
Contributor

ph commented Mar 17, 2022

@blakerouse I think what you are proposing in CheckV2, is the appropriate path. 1 to see if we need to support 2. I am considering that most processes would be fine with the new protocol because they are beats. The endpoint is a bit different, but and they do not need to adopt the Artifact service right away.

Or option 3, start with 1 and if we see an issue that requires us to do 2 then we implement the translation layer.

//
// Transactional store is provided to allow multiple key operations to occur before a commit to ensure consistent
// state when multiple keys make up the state of an units persistent state.
rpc BeginTxn(StoreBeginTxnRequest) returns (StoreBeginTxnResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rename BeginTxn -> BeginTx and other Txns to Txs?
similar to postgres go library for example BeginTx
https://github.com/lib/pq/blob/1ef134dc0e0dcdf8097ca347580c15df72ef786e/conn_go18.go#L58
or mysql
https://github.com/go-sql-driver/mysql/blob/bcc459a906419e2890a50fc2c99ea6dd927a88f2/connection.go#L471

"Tx" is more common abbreviation for transactions in other db libaries

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any handling for example for the broken connection after the transaction was started and left uncommitted? @blakerouse

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksmaus I would expect the method call to return an error when the method is called?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on the storage backend/library and transaction implementation details, not closing the transaction could leave the transaction locks on the database/tables/rows. should probably be coded to have transaction in the scope of the connection or something and the connection is broken the transaction should be auto rolled back. thinking if these kind of cases. don't completely know if this is relevant for our particular implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aleksmaus the state of the connection will be tracked on the server-side (aka. Elastic Agent), when the connection is lost then the transaction will be discarded server-side. At that point that transaction is broken and they client will need to handle the error and create a new transaction

I also think Elastic Agent should place a hard timeout on transactions, as it would not be good to keep them open for a long period of time. They really should be short in time.

I will go through and adjust it all from "Txn" abbr. to "Tx".

cfgLock sync.RWMutex
obsLock sync.RWMutex

kickChan chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rename kickChan ->kickCh etc for other chan vars. This naming convention seems to be more common in https://github.com/golang/go

kickChan chan struct{}
errChan chan error
unitsChan chan UnitChanged
unitsLock sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe rename unitsLock -> unitsMu. This naming convention seems to be more common in https://github.com/golang/go

//
// Transactional store is provided to allow multiple key operations to occur before a commit to ensure consistent
// state when multiple keys make up the state of an units persistent state.
rpc BeginTxn(StoreBeginTxnRequest) returns (StoreBeginTxnResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any handling for example for the broken connection after the transaction was started and left uncommitted? @blakerouse

@ph ph requested review from ph and removed request for lykkin May 18, 2022 13:00
// Response from `SetKey`.
message StoreSetKeyResponse {
// Empty at the moment, defined for possibility of adding future return values.
}
Copy link
Contributor

@ph ph May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I see a TTL maybe or to indicate single-use key.

//
// `txn_id` must be an ID of a transaction that was started with `READ_WRITE`.
//
// Does not error in the case that a key does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, align with Go's expected behavior.

@pzl
Copy link
Member

pzl commented May 18, 2022

@blakerouse @ph

What's the procedure look like for additions to this protocol in the future? If the changes are purely additive to the control protocol, can things be added to this spec in, say, 8.4 timeline?

We are still looking to add a new feature/API to Fleet Server for our next use case, but we don't have a full picture of what's being sent over the wire (aside from, its a big file. Not sure if streamed or pre-chunked. Metadata format, etc)

I assume you want this merged without waiting for us? Or is it much preferred to hammer out something on our side and send it in here

Comment on lines +445 to +448

message LogMessageResponse {
// Empty at the moment, defined for possibility of adding future return values.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for forward change.

for {
expected, err := checkinClient.Recv()
if err != nil {
if err != io.EOF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail the linter.

Suggested change
if err != io.EOF {
if !errors.Is(err, io.EOF) {

return
}

t := time.NewTicker(c.minCheckTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our morning discussion this would behavior weirdlky with sleep or hibernate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the golang I do not believe this to be an issue. The ticker will keep ticking, on hibernate it will just pickup where it left off because it uses the monoclock. The ticker also handles a slow reader so it never adds more than one to the channel so until the next is removed from the channel it will not add another one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the verification!

defer checkinWrite.Done()

if err := c.sendObserved(checkinClient); err != nil {
if err != io.EOF {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err != io.EOF {
if !errors.Is(err, io.EOF) {

}
}
}
c.units = c.units[:i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've almost missed the resize.

Token: c.token,
Id: action.Id,
Status: proto.ActionResponse_FAILED,
Result: ActionTypeUnknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Result: ActionErrUnitNotFound,
}
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this mean that If an agent receive an osquery and it receive an osquery action and the unit isn't runnning we would silently ignore the event?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, not we would report back upstream, ok I think its fine.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few questions in the PR, and so far it looks good to me.
I've been having a few difficulties concerning the Keystore API and implementation, but I find it hard without playing in the context of the agent.

Looking at other comments concerning renaming txn to tx I believe is a good thing to do.

Result: ActionErrUnmarshableParams,
}
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this, +1 don't trust anything.

go func() {
time.Sleep(100 * time.Millisecond)
change.Unit.UpdateState(UnitStateStopped, "Stopped", nil)
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the "Healthy", "Stopping", "Stopped" and other a public constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing make it constant? For the library, no I would prefer they provide more detail than that. Like "Connected to elasticsearch(https://localhost:8200)" or something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK good point.

defer m.Unlock()

if !connected {
return fmt.Errorf("server never recieved valid token")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("server never recieved valid token")
return fmt.Errorf("server never received a valid token")

units = append(units, change.Unit)
unitsLock.Unlock()
default:
panic("not implemented")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use testing's t.Fatal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its in a seperate goroutine, go vet will not allow this.

call to (*T).Fatal from a non-test goroutine

defer m.Unlock()

if !gotInit {
return fmt.Errorf("server never recieved valid token")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("server never recieved valid token")
return fmt.Errorf("server never received valid token")

"github.com/elastic/elastic-agent-client/v7/pkg/utils"
)

// CheckinMinimumTimeout is the amount of time the client must send a new checkin even if the status has not changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above discussion concerning hibernation and sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an issue with the usage of ticker.

// where the action type is unknown.
var ActionTypeUnknown = utils.JSONMustMarshal(map[string]string{
"error": "action type unknown",
})
Copy link
Contributor

@ph ph May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have a concrete type implementing MarshalJSON instead of creating JSON fragment, something like:

ActionTypeErr{ error: "action-type unknown" }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

unitsLock.Unlock()
default:
panic("not implemented")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Fatal

s.store = append(s.store, unitStore)
}
return unitStore
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we have a full in-memory store :D


expLock sync.RWMutex
exp UnitState
config string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the string representation of the yaml, this seems asking for having his own type? We we can validate the presence of some fields depending on the unit or that we have valid yaml.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly we can have a yaml.RawMessage because of the way indent works and could break the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it YAML, JSON, or something else. The point of the client is that it doesn't know and doesn't care. So its best to keep it just as a string representation that could then be decoded.

@cmacknz cmacknz requested a review from andrewkroh May 18, 2022 19:58
@ph ph requested a review from cmacknz May 19, 2022 13:23
@ph
Copy link
Contributor

ph commented May 19, 2022

@cmacknz Can you review this too, this directly impact the data plane?

@blakerouse
Copy link
Contributor Author

@aleksmaus @ph Made the requested changes.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Let's get another approval from @elastic/elastic-agent-data-plane

@@ -25,6 +41,8 @@ message ConnInfo {
bytes peer_cert = 5;
// Peer private key.
bytes peer_key = 6;
// Allowed services that spawned process can use. (only used in V2)
repeated ConnInfoServices services = 7;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it expected that processes do with the services list? Is the intent that each process checks to make sure the services it needs are actually available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. A process can only use the services that it is given, if it tries to Elastic Agent will return an error back to the caller.

}

// Errors returns channel of errors that occurred during communication.
func (c *clientV2) Errors() <-chan error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document that failing to read Errors() or UnitChanges() will cause the client to block internally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented on the interface.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I expect we will have more feedback once we actually try to use the new V2 client :)

@ph
Copy link
Contributor

ph commented May 19, 2022

@cmacknz I also change when we try the real world.

@fearful-symmetry
Copy link
Contributor

Agreed on real-world testing, I'm honestly kinda planning with the assumption that a lot of the APIs are gonna change over time as we actually use it.

@blakerouse blakerouse merged commit 43bacbe into elastic:main May 24, 2022
@blakerouse blakerouse deleted the v2-protocol branch May 24, 2022 13:19
v1v pushed a commit to v1v/elastic-agent-client that referenced this pull request Sep 5, 2022
* Move httpcommon from libbeat

* updates to satisfy linter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Design] Elastic Agent Control Protocol v2 gRPC.