Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

Closed
wants to merge 3 commits into from

Conversation

yanxiaole
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds a FileNotFoundException try catch block while adding a new entry to history server application listing to skip the non-existing path.

Why are the changes needed?

If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from foo.inprogress to foo.

It leads to a problem when adding an entry to the listing, querying file status like fileSizeForLastIndex will throw out a FileNotFoundException exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log.

20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates
 java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
 at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)

Does this PR introduce any user-facing change?

No

How was this patch tested?

  1. setup another script keeps changing the filename of applications under history log dir
  2. launch the history server
  3. check whether the File does not exist error log was gone.

@HeartSaVioR
Copy link
Contributor

ok to test

} catch {
case _: FileNotFoundException => false
}
case _: FileNotFoundException =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd have empty new line after }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

nit: I'd have empty new line after }.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127080 has finished for PR 29350 at commit d1bf4ca.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127079 has finished for PR 29350 at commit d113709.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

retest this, please

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's non-trivial to create a test for testing this behavior, so OK to go without test.

Probably we also want to log and swallow the exception per entry, so that exception from any entry would affect others. That's just an idea and doesn't necessarily to be applied in this PR.

@HeartSaVioR
Copy link
Contributor

cc. @vanzin @gengliangwang

@yanxiaole
Copy link
Contributor Author

yanxiaole commented Aug 5, 2020

Probably we also want to log and swallow the exception per entry, so that exception from any entry would affect others.

Yes, I think it would be better.

If this is the idea we are going to apply, I can submit a PR about it later. Probably start from a general exception and log it? What do you think? @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127091 has finished for PR 29350 at commit d1bf4ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your first contribution, @yanxiaole .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. I also agree with @HeartSaVioR 's assessment on the test cases.
Merged to master/3.0.

dongjoon-hyun pushed a commit that referenced this pull request Aug 5, 2020
… status change

# What changes were proposed in this pull request?
This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path.

### Why are the changes needed?
If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`.

It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log.

```
20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates
 java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
 at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. setup another script keeps changing the filename of applications under history log dir
2. launch the history server
3. check whether the `File does not exist` error log was gone.

Closes #29350 from yanxiaole/SPARK-32529.

Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit c1d17df)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

@yanxiaole . You are added to the Apache Spark contributor group and SPARK-32529 is assigned to you. Welcome.

@HeartSaVioR
Copy link
Contributor

@yanxiaole

If this is the idea we are going to apply, I can submit a PR about it later. Probably start from a general exception and log it? What do you think? @HeartSaVioR

Yeah I think it's OK. My sketched idea is wrapping whole lambda function in filter with try-catch, and logging and swallowing NonFatal.

@yanxiaole
Copy link
Contributor Author

Thank you, @HeartSaVioR and @dongjoon-hyun .
I have submitted a new PR #29374 to address the general NonFatal exception.

dongjoon-hyun pushed a commit that referenced this pull request Aug 9, 2020
… History server

### What changes were proposed in this pull request?
This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry.

### Why are the changes needed?
As discussed in #29350 , one entry failure shouldn't affect others.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually tested.

Closes #29374 from yanxiaole/SPARK-32557.

Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
HeartSaVioR pushed a commit to HeartSaVioR/spark that referenced this pull request Oct 15, 2020
… History server

### What changes were proposed in this pull request?
This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry.

### Why are the changes needed?
As discussed in apache#29350 , one entry failure shouldn't affect others.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually tested.

Closes apache#29374 from yanxiaole/SPARK-32557.

Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
HeartSaVioR pushed a commit that referenced this pull request Oct 19, 2020
… History server

### What changes were proposed in this pull request?
This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry.

### Why are the changes needed?
As discussed in #29350 , one entry failure shouldn't affect others.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually tested.

Closes #29374 from yanxiaole/SPARK-32557.

Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
… History server

### What changes were proposed in this pull request?
This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry.

### Why are the changes needed?
As discussed in apache#29350 , one entry failure shouldn't affect others.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually tested.

Closes apache#29374 from yanxiaole/SPARK-32557.

Authored-by: Yan Xiaole <xiaole.yan@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@Lobo2008
Copy link

Hi, I'm using the latest Spark 3.3.0 but still got the exception java.io.FileNotFoundException: File does not exist: /LOG_DIR/application_1657344020931_1400038_1.inprogress..

@dongjoon-hyun
Copy link
Member

Hi, @Lobo2008 . Could you file a new JIRA issue with the stack trace or logs?

@Lobo2008
Copy link

application

Hi, @Lobo2008 . Could you file a new JIRA issue with the stack trace or logs?

Hi @dongjoon-hyun , Sorry that I misunderstood this PR. What I came across is that there are always some event logs cannot be seen from SHS.

  • in HDFS: hadoop fs -ls hdfs://xxxx/eventLog/* |grep -v 'inprogress'|wc -l gets 57307
  • in SHS: Showing 1 to 20 of 57,258 entries ,49 apps missed

These 49 event logs maybe ok because when I

  • hadoop cp app_123 to app_123_2.inprogess
  • then hadoop mv app_123_2.inprogess to app_123_2
  • app_123 appears in SHS

This PR is skipping these 49 apps not resolving them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants