Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-23000] Add log bundler to collect and compress logs #16166

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rmetzger
Copy link
Contributor

@rmetzger rmetzger commented Jun 15, 2021

What is the purpose of the change

Users often struggle to provide all relevant logs. By having a button in the Web UI that collects the logs of all TaskManagers and the JobManager will make things easier for the community supporting users.

Screenshots:

image
image
image

Verifying this change

  • I manually tested this locally, I'll also test it on a distributed cluster with larger log files
  • There's an integration test to test the happy path
  • There are a few tests for error cases

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Comment on lines 35 to 43
triggerBundle() {
var res = this.httpClient.get(`${BASE_URL}/logbundler?action=trigger`).pipe(catchError(() => {
return EMPTY
}));
// TODO this seems to be needed to trigger the call?
res.toPromise().then(() => {
console.log("fulfilled")
})
return res;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Airblader What do I have to do to do a "fire and forget" HTTP call here?

Could you also take a look at the other front-end related changes? Thanks a lot!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at the rest, but to answer the question quickly:

this.http.get(...).subscribe();

Observables are lazy, you need to subscribe to them for it to fire. The toPromise hack worked because promises are eager and thus the conversion triggers it.

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 15, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 436ccf1 (Sat Aug 28 13:07:42 UTC 2021)

Warnings:

  • 1 pom.xml files were touched: Check for build and licensing issues.
  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@rmetzger
Copy link
Contributor Author

rmetzger commented Jun 15, 2021

Open TODOs:

  • Test on minikube with large log files (to make sure there are no locks, or any other issues with remote file transfers / large files)
  • Add a few more tests for the LogArchiver
  • Regenerate the REST API doc?
  • Address CI failures (I made some changes to some shared infra)

@rmetzger
Copy link
Contributor Author

@zentol Could you take a look at this PR?

@knaufk
Copy link
Contributor

knaufk commented Jun 15, 2021

Are there any requirements w.r.t to the log configuration for this feature to work?

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 15, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@rmetzger
Copy link
Contributor Author

The logs need to be accessible in the Flink UI, since its using the same mechanism for fetching the logs.
Logs are accessible from the UI if the taskmanager.log.path configuration parameter is set, or the log.file system property. We'll fetch the file stored in that location.
I believe this works with all default deployments, where the default log4j2 config / deployment scripts/code takes care of configuring this correctly.

Copy link
Contributor

@Airblader Airblader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've exclusively reviewed the frontend changes)

@@ -49,6 +49,9 @@ <h1>Apache Flink Dashboard</h1></a>
<span><i nz-icon type="upload"></i><span>Submit New Job</span></span>
</li>
</ul>
<nz-affix *ngIf="!collapsed" style="position: absolute; bottom: 15px; left:15px" id="affix-container-target">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline styling is a bad practice, the minimum should be to use a CSS class instead and move the styles into app.component.less.

Absolute positioning is kind of an art form. It can be the fitting solution in some cases, but most of the time it's an escape hatch that just causes problems (for example when collapsing the menu, as you noticed), because it takes the element entirely out of the document flow. If the intent is simply to have this on the bottom of the menu, an easier solution would be

margin: auto 15px 15px 15px;

I haven't verified this here, but it should work (we use the same thing in Ververica Platform).

(The final nit would be that I'm guessing the value of 15px was more or less chosen at random, AntD actually uses a different progression for spacing, so technically 12px or 16px would be in line with the design language.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The margin positioning didn't work:
image

(But let's maybe address this once we have agreement where to place the download button in general)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But let's maybe address this once we have agreement where to place the download button in general

Sounds good. If it does stay here, I'd also not hard-oppose the absolute positioning, but alternatively we could also do this in pairing.

*/

export interface LogsBundlerStatus {
status: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines should both end with ;.

@@ -83,7 +84,7 @@ export function AppInitServiceFactory(

@NgModule({
declarations: [AppComponent],
imports: [BrowserModule, AppRoutingModule, NgZorroAntdModule, FormsModule, HttpClientModule, BrowserAnimationsModule],
imports: [BrowserModule, AppRoutingModule, NgZorroAntdModule, FormsModule, HttpClientModule, BrowserAnimationsModule, ShareModule],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be improperly formatted, both the import and this line here. If the tooling doesn't complain then I guess we have no tooling for the UI for this, in which case feel free to disregard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the formatting, but no tooling complains here. Would be nice to use an auto formatter for the project, in particular for noobies like me.

Comment on lines 35 to 37
triggerBundle() {
var res = this.httpClient.get(`${BASE_URL}/logbundler?action=trigger`).pipe(catchError(() => {
return EMPTY
}));
// TODO this seems to be needed to trigger the call?
res.toPromise().then(() => {
console.log("fulfilled")
})
return res;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
triggerBundle() {
var res = this.httpClient.get(`${BASE_URL}/logbundler?action=trigger`).pipe(catchError(() => {
return EMPTY
}));
// TODO this seems to be needed to trigger the call?
res.toPromise().then(() => {
console.log("fulfilled")
})
return res;
}
triggerBundle() {
return this.httpClient.get(`${BASE_URL}/logbundler?action=trigger`).pipe(
catchError(() => EMPTY)
);
}

The subscribe should not actually happen here, but rather at the caller site anyway. Otherwise this is very surprising behavior. Either you fire-and-forget, meaning you don't return anything, or you return the observable and let the caller manage the subscription. Since this is a service which acts as a facade to the API, I would always opt for returning the observable either way so the caller can choose whether and what to do with the response.

-->

<button nz-button nzType="primary" (click)="requestArchive()">Request New Log Archive</button>
<button nz-button nzType="primary" [hidden]="hideDownloadButton" (click)="downloadArchive()" style="margin-top:5px">Download Log Archive</button>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use inline CSS for the styling (and use 4px?), also in the lines below.

*/

export interface LogsBundlerStatus {
status: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the possible values for status are known, so we could be more precise than string here:

status: "PROCESSING" | "BUNDLE_READY" | "… possible other ones…";

With string literal types we get completion and better type-safety for where you compare the status against these values.

this.hideSpinner = true;
this.hideDownloadButton = true;

if(status.status == "PROCESSING") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JavaScript, using == is a deadly sin these days, and we use === instead. :-)

Comment on lines 55 to 63
this.hideSpinner = true;
this.hideDownloadButton = true;

if(status.status == "PROCESSING") {
this.hideSpinner = false;
}
if (status.status == "BUNDLE_READY") {
this.hideDownloadButton = false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.hideSpinner = true;
this.hideDownloadButton = true;
if(status.status == "PROCESSING") {
this.hideSpinner = false;
}
if (status.status == "BUNDLE_READY") {
this.hideDownloadButton = false;
}
this.hideSpinner = (status.status !== "PROCESSING");
this.hideDownloadButton = (status.status !== "BUNDLE_READY");

?

this.hideSpinner = false;
this.hideDownloadButton = true;
}
downloadArchive() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think this method belongs into the service since it builds the specific path of the API, which is generally one of the things you want to encapsulate in those services so that components don't need to be aware of the API.

*/
triggerBundle() {
var res = this.httpClient.get(`${BASE_URL}/logbundler?action=trigger`).pipe(catchError(() => {
return EMPTY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping errors to EMPTY means this observable will emit nothing (and just complete) in a failure case. I'll have to read on to see if this is problematic.

Edit from my future self: It means that if the API starts failing, the UI will simply not update anymore. Probably not what we'd want, so a nicer solution might be to map to a "fake" response, or let the error propagate and handle it at the caller site.

This is better than doing nothing at all, at least, since it prevents the error from clogging up the console. But it would leave the user in the dark. I'll leave it to you whether this is something we want to address or not given that this is probably the status quo of how things are done in Flink UI and thus this doesn't make it worse.

@Airblader
Copy link
Contributor

On a very high level, putting this feature (and not just a button, but two + spinner + message) into the navigation menu feels a bit cluttered to me, personally. It makes this a super important thing users will be looking at at all times when they're in the UI. I'd personally probably put this into one of the pages rather than the side menu, but I might be unaware of considerations, discussions and decisions to place it there.

changeDetection: ChangeDetectionStrategy.OnPush
})
export class LogsBundlerComponent implements OnInit{
destroy$ = new Subject();
Copy link
Contributor

@Airblader Airblader Jun 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this isn't actually ever "fed", which means this component currently leaks subscriptions. You need this component to implement OnDestroy and then add

ngOnDestroy() {
  this.destroy$.next();
  this.destroy.complete();
}

@rmetzger
Copy link
Contributor Author

Thanks a lot for your extensive feedback, it was very helpful!

On a very high level, putting this feature (and not just a button, but two + spinner + message) into the navigation menu feels a bit cluttered to me, personally.

I'm also not fully convinced that the current location makes a lot of sense. The other option would be the "Overview" page.
Let's see what the other reviewers think.

*/

export interface LogsBundlerStatus {
status: "IDLE" | "PROCESSING" | "BUNDLE_READY" | "BUNDLE_FAILED"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing ; still

this.hideDownloadButton = (status.status !== "BUNDLE_READY");
this.cdr.markForCheck();
}, error => {
this.message = "Error while fetching status: " + error.message;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're going to want to this.cdr.markForCheck(); here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true true, need to pay more attention 🙈

Copy link
Contributor

@Airblader Airblader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI changes look good to me (% formatting, but this is a job for tooling)

internalEntryName = getNextEntryName(internalEntryName);
}
entryNames.add(internalEntryName);
// ArchiveEntry entry = archiveOutputStream.createArchiveEntry(file, internalEntryName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented out code?

@rmetzger
Copy link
Contributor Author

Thanks a lot for your approval.

I would like to revisit the location and prominence of the download mechanism. Instead of having it in the sidebar, how about this (cc @knaufk)

image

@Airblader
Copy link
Contributor

+1 to that proposal. I think the side menu is both too prominent and also not the place for such actions at all.

@Airblader
Copy link
Contributor

@rmetzger It looks like Konstantin agrees with the proposal (not sure you saw it).

@rmetzger
Copy link
Contributor Author

rmetzger commented Jul 9, 2021

Yes, I saw his reaction. I'll pick up this PR again after my vacation.

@rmetzger
Copy link
Contributor Author

@zentol Could you review this PR?

@zentol zentol self-assigned this Aug 2, 2021
Comment on lines +88 to +89
// copy with limit: the input log file could have additional data, which would confuse
// the archiver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was indeed sloppy. I added a more extensive one to the method:

    /**
     * Special copy method which defines a "reading window" on the input stream. The
     * "TarArchiveOutputStream" expects an TarArchiveEntry with a defined size. If the log file is
     * growing while the entry is added to the output stream, it will throw an exception. This
     * method ensures that only "limit" bytes will be written from the "input" to the "output"
     * stream.
     */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only case where the file can grow are the JM logs, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then the limit makes sense.
But it does make me wonder how well this approach works when the log files roll over; the contents may suddenly change, as does the file size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(How well does the log transmission from the TM handle rollovers?)

entry.setSize(size);
entry.setMode(TarArchiveEntry.DEFAULT_FILE_MODE);
entry.setModTime(modTime);
entry.setUserName("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment that is done because it defaults to the user.name system property.

What does the userName actually do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

@Test
public void testCopy_limited2() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test for?


// status of the bundler
@GuardedBy("statusLock")
private volatile Status status = Status.IDLE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this handler implemented with a custom state machine, instead of as async handler like the savepoint handler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I wasn't aware of the async handler. I'll try to use it.

Comment on lines +126 to +128
this.message = "No tmp directory available for log bundler";
log.error(message);
this.tmpDir = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to handle this case. If no tmpDir was configured the cluster won't even start up.

}
synchronized (statusLock) {
final String action = queryParams.get(0);
if ("download".equals(action)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move download/trigger into an enum that is also used in the headers.

private void addTaskManagerLogFile(TaskManagerLogAndId logFile) {
checkNotNull(logArchiver, "Assuming log archiver to be set");
try {
logArchiver.addArchiveEntry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should run in a separate executor to ensure the REST API remains responsive.

return resourceManagerGateway
.requestTaskManagerFileUploadByType(
taskManager.getResourceId(), FileType.LOG, rpcTimeout)
.thenApplyAsync(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is an async operation such that subsequent steps run in the executor? If so, then I'd prefer making this explicitly when we setup the addTaskManagerLogFile call.

}

private void collectAndCompressLogs() {
synchronized (statusLock) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locking kind of voids the benefit of using an executor, because as long as this operation is running any request to the handler results in a blocked thread.

extends AbstractHandler<RestfulGateway, EmptyRequestBody, LogBundlerMessageParameters> {

// set high timeout for uploading files to the blob store
private final Time rpcTimeout = Time.minutes(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be based in some form on the configured rpc timeout.

@rmetzger
Copy link
Contributor Author

FYI: I was hoping to find some time to address the last outstanding comments before my parental leave, but I won't be able to.
Maybe I find time in December to look into this again.
If somebody else wants to take over this work, they are more than welcome to do so. I'm happy to occasionally respond to questions.

@MartijnVisser
Copy link
Contributor

@rmetzger Do you still plan to complete this feature?

@rmetzger
Copy link
Contributor Author

I like the feature, but I currently lack the time to complete it. Afaik there's some substantial effort required to address all remaining PR comments.

@zentol zentol removed their assignment Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants