Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add startup logs to build log UI #2957

Closed
Tracked by #4978
ammario opened this issue Jul 13, 2022 · 19 comments
Closed
Tracked by #4978

Add startup logs to build log UI #2957

ammario opened this issue Jul 13, 2022 · 19 comments
Assignees
Labels
feature Something we don't have yet
Milestone

Comments

@ammario
Copy link
Member

ammario commented Jul 13, 2022

I have a complex startup script:

resource "coder_agent" "dev" {
  arch           = "amd64"
  os             = "linux"
  startup_script = <<EOF
    #!/bin/sh
    # install and start code-server
    curl -fsSL https://code-server.dev/install.sh | sh
    code-server --auth none --port 13337
    sudo service docker start
    if [ -f ~/personalize ]; then ~/personalize 2>&1 | tee  ~/.personalize.log; fi
    EOF
}

And it's not clear to me where I could see its log output. It would also be nice to not have to redirect command execution to logs manually.

The natural place for me to find this is the build log on the dashboard, but I also expect some file in my workspace to contain the log.

While we're at it, we should include the agent logs (/var/log/coder-agent.log) in the dashboard build log.

@ammario ammario added the feature Something we don't have yet label Jul 13, 2022
@ammario ammario changed the title Add startup script log to build log Add startup logs to build log Jul 31, 2022
@ammario
Copy link
Member Author

ammario commented Jul 31, 2022

So, the log is in /var/log/coder-startup-script.log. I had to read the documentation to find that, so the issue still stands as a major usability improvement.

@ketang

This comment was marked as off-topic.

@ammario ammario changed the title Add startup logs to build log Add startup logs to build log UI Jul 31, 2022
@ammario ammario added this to the EE milestone Aug 22, 2022
@sreya sreya self-assigned this Sep 8, 2022
@ammario ammario assigned ammario and unassigned sreya and ammario Sep 26, 2022
@ammario
Copy link
Member Author

ammario commented Oct 17, 2022

Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.

@ammario
Copy link
Member Author

ammario commented Oct 18, 2022

An implementation challenge is startup_script can block for an indeterminate amount of time. Some of our examples spin off code-server asynchronously, but others block.

@sreya
Copy link
Collaborator

sreya commented Oct 19, 2022

Let's make sure we use a PrefixSuffixWriter (First N bytes + Last N bytes) to bound how much storage these logs can take. We also may need the same for the provisioner logs, but I haven't investigated the code.

The only issue is if we're expecting some sort of prompt to be written to the log (something I know @kylecarbs mentioned when we discussed this Monday) then it might get truncated by the PrefixSuffixWriter.

We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.

The startup logs of previous builds are not likely to be useful.

@ammario
Copy link
Member Author

ammario commented Oct 20, 2022

Some kind of limiter is absolutely necessary to prevent DoS, e.g cat /dev/urandom.

We could potentially only keep the startup logs of the most recent build and maybe also use a PrefixSuffixWriter with a 1MB cap to prevent any maliciousness. That would more than likely solve 99% of cases and prevent us from ballooning over time.

This sounds good to me.

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

This issue is becoming stale. In order to keep the tracker readable and actionable, I'm going close to this issue in 7 days if there isn't more activity.

@github-actions github-actions bot added the stale This issue is like stale bread. label Jan 4, 2023
@BrunoQuaresma BrunoQuaresma self-assigned this Jan 5, 2023
@BrunoQuaresma
Copy link
Collaborator

Since startup logs can run for multiple agents, I think we can add some sort of tabs for it.

Screen Shot 2023-01-05 at 16 08 49

@coadler coadler removed the stale This issue is like stale bread. label Jan 5, 2023
@kylecarbs
Copy link
Member

What would the workspace status be if startup scripts are still running?

@ammario
Copy link
Member Author

ammario commented Jan 6, 2023

I think MVP here should be just putting the logs in the UI, and then in a follow-up feature we address waiting for startup scripts to finish. I believe some templates have startup scripts that block, e.g. they run code-server. Thus, a special workspace status for startup scripts could be a breaking / confusing change for existing deployments.

@BrunoQuaresma
Copy link
Collaborator

This is a good question. I don't know, I think it could be wither running or pending. However, I don't think this influences showing the startup logs on the build page too much.

@mafredri
Copy link
Member

The discussion in #4677 is relevant to workspace status during startup (/shutdown). Relating to blocking in startup script, I think that should be avoided/discouraged and introduce something new, like coder_agent_service to start programs. This would allow more complex functionality like restarting services after X failed health checks and what not. (Perhaps a topic for a separate issue though).

I agree with @ammario that we limit this scope to simply storing/streaming the log. I suggest we apply compression (e.g. zstd via klauspost/compress) to the logs but do we do it pre or post- limiting output to 1MB? On the one hand pre-compression will ensure the output isn't usually truncated (in failure scenarios there could be a lot of output to sift through). On the other hand, compression + a for loop echoing the same thing could produce a very huge uncompressed log.

@mafredri mafredri self-assigned this Jan 10, 2023
@ammario
Copy link
Member Author

ammario commented Jan 10, 2023

If we compress the log on capture, wouldn't that make it difficult / impossible to stream the log piecewise in the future?

@mafredri
Copy link
Member

@ammario I understand your concern. I think it's doable but we can punt it for now. It'll add some complexity and raises the question if we should just utilize postgres compression instead (for the storage layer). My main thought with suggesting it was to allow larger logs to be stored, but reduced bandwidth could be a nice bonus.

On the topic, it's possible to use zstd as a transparent compression layer via SaveTheRbtz/zstd-seekable-format-go (I've only tested that it works, no other experience with seekable). Assuming we can trust it's one frame/line (observed behavior), this would map pretty well to inserts in the database.

One simpler solution we may employ is to compact logs once they're done streaming and store it as a single bytea blob. Unfortunately, lib/pq would require us to fetch the entire blob, so we can't do streaming decompression for memory savings.

We could also just do compaction for the "historical" jobs, which might allow us to keep them around. I personally think they'd be useful for debugging purposes by both admins and users. (I.e. what's different today, this worked yesterday!)

@ammario
Copy link
Member Author

ammario commented Jan 11, 2023

Right now logs older than 30 days are getting deleted automatically, so I don't think we have to worry about storage. An upper-bounded workspace would generate 30MB of logs per month.

And good points on the lack of seekability in the pq driver. That is unfortunate.

@BrunoQuaresma
Copy link
Collaborator

Removing myself from this until we have the BE done. It makes it easier for me to see what issues I should focus on.

@mafredri
Copy link
Member

mafredri commented Feb 1, 2023

@ammario I've got a few questions for implementing this functionality.

(One thing worth noting also is that "startup scripts that end" has been implemented, and our examples adjusted, see #5749.)

  1. Do we want a push or pull architecture, or both?
  2. Regarding proposed PrefixSuffixWriter, do we send the tail once startup script has ended or do we try to stream the tail continuously?

To expand on 1.

Push architecture pros/cons:

  • + Agent logs will always be available, even if the workspace is shut down
  • - N number of agents will simultaneously be sending logs to the server
  • - Server needs to monitor load and tell agents to back off if it gets too rough
  • - Causes load even when logs are not needed
  • - If we want to stream the "tail" part of a "limited" log, we have to either send & delete or wait until e.g. startup script has completed

Pull architecture pros/cons:

  • + No additional load except when someone is viewing the logs (e.g. workspace build logs, trying to SSH in before workspace is ready, etc.)
  • + Since we're always streaming logs we have more freedom with how we handle limits and streaming the tail part of the log
  • + We can do seeking to reveal any part of the log, if needed
  • - Logs are ephemeral and not available after workspace stop
  • + Can be extended to scrape logs by server at certain points

If we go with push, do we go with (d)RPC to reduce load?

If we go with pull, would this be a suitable addition to the "statistics server" on the agent @kylecarbs? We'd probably rename it though. Reason I ask is that I worry about allocating too many listen ports, we've not reserved that many.

And re: 2., assuming we go with push, I imagine an implementation where we allow straight off 1MB(-64KB) to be streamed and stored, then the remaining "tail" is stored in a circular buffer of, say, 64KB and will be committed once the startup script exits. Does this match what you had in mind @ammario?

The downside of this is that the user will not see what's currently happening in the log (which is an edge case since they'd exceeded the ~1MB limit).

The alternative of streaming the tail part doesn't really prevent DoS.


As for /tmp/coder-agent.log, we might to break that out into a separate issue. For instance, if we go with push, we need to decide what parts of the logs are sent/kept. For instance, we could consider "all logs from start until workspace lifecycle = ready" to be included in the build logs. Likewise once we add shutdown, all logs from lifecycle = shutting_down -> stopped could also be added to the build log.

None of that would allow inspecting the current log, though. That'd still fall under pull category. As for why we need to break it up into parts is that my workspace log after running for over a day is 3.8MB.

@ammario
Copy link
Member Author

ammario commented Feb 1, 2023

I think the pull architecture is the way to go:

  • We can avoid PrefixSuffixWriter, always giving the user the full log
  • We can seek, efficiently serving massive logs to the frontend
  • The code is significantly simpler

I also think we should only have a single listen server on the agent, as is typical of most daemons.

@code-asher code-asher self-assigned this Feb 27, 2023
@kylecarbs kylecarbs assigned kylecarbs and unassigned mafredri and code-asher Mar 14, 2023
@bpmct
Copy link
Member

bpmct commented Mar 28, 2023

Closing as completed by #6558.

@bpmct bpmct closed this as completed Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Something we don't have yet
Projects
None yet
Development

No branches or pull requests

9 participants