[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

yanxiaole · 2020-08-04T16:17:01Z

What changes were proposed in this pull request?

This PR adds a FileNotFoundException try catch block while adding a new entry to history server application listing to skip the non-existing path.

Why are the changes needed?

If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from foo.inprogress to foo.

It leads to a problem when adding an entry to the listing, querying file status like fileSizeForLastIndex will throw out a FileNotFoundException exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log.

20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates
 java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
 at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
 at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
 at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)

Does this PR introduce any user-facing change?

No

How was this patch tested?

setup another script keeps changing the filename of applications under history log dir
launch the history server
check whether the File does not exist error log was gone.

… status change.

HeartSaVioR · 2020-08-05T05:10:56Z

ok to test

HeartSaVioR · 2020-08-05T05:13:38Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+              } catch {
+                case _: FileNotFoundException => false
+              }
+            case _: FileNotFoundException =>


nit: I'd have empty new line after }.

added

nit: I'd have empty new line after }.

SparkQA · 2020-08-05T07:05:01Z

Test build #127080 has finished for PR 29350 at commit d1bf4ca.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-05T07:05:02Z

Test build #127079 has finished for PR 29350 at commit d113709.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-08-05T08:02:31Z

retest this, please

HeartSaVioR

I think it's non-trivial to create a test for testing this behavior, so OK to go without test.

Probably we also want to log and swallow the exception per entry, so that exception from any entry would affect others. That's just an idea and doesn't necessarily to be applied in this PR.

HeartSaVioR · 2020-08-05T08:50:21Z

cc. @vanzin @gengliangwang

yanxiaole · 2020-08-05T09:02:54Z

Probably we also want to log and swallow the exception per entry, so that exception from any entry would affect others.

Yes, I think it would be better.

If this is the idea we are going to apply, I can submit a PR about it later. Probably start from a general exception and log it? What do you think? @HeartSaVioR

SparkQA · 2020-08-05T10:28:44Z

Test build #127091 has finished for PR 29350 at commit d1bf4ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Thank you for your first contribution, @yanxiaole .

dongjoon-hyun

+1, LGTM. I also agree with @HeartSaVioR 's assessment on the test cases.
Merged to master/3.0.

… status change # What changes were proposed in this pull request? This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path. ### Why are the changes needed? If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`. It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log. ``` 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. setup another script keeps changing the filename of applications under history log dir 2. launch the history server 3. check whether the `File does not exist` error log was gone. Closes #29350 from yanxiaole/SPARK-32529. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c1d17df) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-08-05T18:02:49Z

@yanxiaole . You are added to the Apache Spark contributor group and SPARK-32529 is assigned to you. Welcome.

HeartSaVioR · 2020-08-06T00:48:41Z

@yanxiaole

If this is the idea we are going to apply, I can submit a PR about it later. Probably start from a general exception and log it? What do you think? @HeartSaVioR

Yeah I think it's OK. My sketched idea is wrapping whole lambda function in filter with try-catch, and logging and swallowing NonFatal.

yanxiaole · 2020-08-06T07:12:14Z

Thank you, @HeartSaVioR and @dongjoon-hyun .
I have submitted a new PR #29374 to address the general NonFatal exception.

… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in #29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in apache#29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes apache#29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in #29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in apache#29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes apache#29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Lobo2008 · 2022-08-25T10:14:40Z

Hi, I'm using the latest Spark 3.3.0 but still got the exception java.io.FileNotFoundException: File does not exist: /LOG_DIR/application_1657344020931_1400038_1.inprogress..

dongjoon-hyun · 2022-08-25T23:40:12Z

Hi, @Lobo2008 . Could you file a new JIRA issue with the stack trace or logs?

Lobo2008 · 2022-08-29T13:32:34Z

application

Hi, @Lobo2008 . Could you file a new JIRA issue with the stack trace or logs?

Hi @dongjoon-hyun , Sorry that I misunderstood this PR. What I came across is that there are always some event logs cannot be seen from SHS.

in HDFS: hadoop fs -ls hdfs://xxxx/eventLog/* |grep -v 'inprogress'|wc -l gets 57307
in SHS: Showing 1 to 20 of 57,258 entries ，49 apps missed

These 49 event logs maybe ok because when I

hadoop cp app_123 to app_123_2.inprogess
then hadoop mv app_123_2.inprogess to app_123_2
app_123 appears in SHS

This PR is skipping these 49 apps not resolving them.

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application…

d3f7a6c

… status change.

probot-autolabeler bot added the CORE label Aug 4, 2020

[SPARK-32529][CORE] add more catch blocks

d113709

HeartSaVioR reviewed Aug 5, 2020

View reviewed changes

[SPARK-32529][CORE] add an empty new line after }

d1bf4ca

HeartSaVioR approved these changes Aug 5, 2020

View reviewed changes

dongjoon-hyun reviewed Aug 5, 2020

View reviewed changes

dongjoon-hyun approved these changes Aug 5, 2020

View reviewed changes

dongjoon-hyun closed this in c1d17df Aug 5, 2020

yanxiaole deleted the SPARK-32529 branch August 6, 2020 04:49

yanxiaole mentioned this pull request Aug 6, 2020

[SPARK-32557][CORE] Logging and swallowing the exception per entry in History server #29374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

yanxiaole commented Aug 4, 2020

HeartSaVioR commented Aug 5, 2020

HeartSaVioR Aug 5, 2020

yanxiaole Aug 5, 2020

SparkQA commented Aug 5, 2020

SparkQA commented Aug 5, 2020

HeartSaVioR commented Aug 5, 2020

HeartSaVioR left a comment

HeartSaVioR commented Aug 5, 2020

yanxiaole commented Aug 5, 2020 •

edited

Loading

SparkQA commented Aug 5, 2020

dongjoon-hyun left a comment

dongjoon-hyun left a comment

dongjoon-hyun commented Aug 5, 2020

HeartSaVioR commented Aug 6, 2020

yanxiaole commented Aug 6, 2020

Lobo2008 commented Aug 25, 2022

dongjoon-hyun commented Aug 25, 2022

Lobo2008 commented Aug 29, 2022

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350

Conversation

yanxiaole commented Aug 4, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Aug 5, 2020

HeartSaVioR Aug 5, 2020

Choose a reason for hiding this comment

yanxiaole Aug 5, 2020

Choose a reason for hiding this comment

SparkQA commented Aug 5, 2020

SparkQA commented Aug 5, 2020

HeartSaVioR commented Aug 5, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Aug 5, 2020

yanxiaole commented Aug 5, 2020 • edited Loading

SparkQA commented Aug 5, 2020

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Aug 5, 2020

HeartSaVioR commented Aug 6, 2020

yanxiaole commented Aug 6, 2020

Lobo2008 commented Aug 25, 2022

dongjoon-hyun commented Aug 25, 2022

Lobo2008 commented Aug 29, 2022

yanxiaole commented Aug 5, 2020 •

edited

Loading