-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32529][CORE] Fix Historyserver log scan aborted by application status change #29350
Conversation
ok to test |
} catch { | ||
case _: FileNotFoundException => false | ||
} | ||
case _: FileNotFoundException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd have empty new line after }
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
nit: I'd have empty new line after
}
.
Test build #127080 has finished for PR 29350 at commit
|
Test build #127079 has finished for PR 29350 at commit
|
retest this, please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's non-trivial to create a test for testing this behavior, so OK to go without test.
Probably we also want to log and swallow the exception per entry, so that exception from any entry would affect others. That's just an idea and doesn't necessarily to be applied in this PR.
Yes, I think it would be better. If this is the idea we are going to apply, I can submit a PR about it later. Probably start from a general exception and log it? What do you think? @HeartSaVioR |
Test build #127091 has finished for PR 29350 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your first contribution, @yanxiaole .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. I also agree with @HeartSaVioR 's assessment on the test cases.
Merged to master/3.0.
… status change # What changes were proposed in this pull request? This PR adds a `FileNotFoundException` try catch block while adding a new entry to history server application listing to skip the non-existing path. ### Why are the changes needed? If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from `foo.inprogress` to `foo`. It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log. ``` 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11111111111111.lz4.inprogress at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. setup another script keeps changing the filename of applications under history log dir 2. launch the history server 3. check whether the `File does not exist` error log was gone. Closes #29350 from yanxiaole/SPARK-32529. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c1d17df) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@yanxiaole . You are added to the Apache Spark contributor group and SPARK-32529 is assigned to you. Welcome. |
Yeah I think it's OK. My sketched idea is wrapping whole lambda function in filter with try-catch, and logging and swallowing |
Thank you, @HeartSaVioR and @dongjoon-hyun . |
… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in #29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in apache#29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes apache#29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in #29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes #29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… History server ### What changes were proposed in this pull request? This PR adds a try catch wrapping the History server scan logic to log and swallow the exception per entry. ### Why are the changes needed? As discussed in apache#29350 , one entry failure shouldn't affect others. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Closes apache#29374 from yanxiaole/SPARK-32557. Authored-by: Yan Xiaole <xiaole.yan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Hi, I'm using the latest Spark 3.3.0 but still got the exception |
Hi, @Lobo2008 . Could you file a new JIRA issue with the stack trace or logs? |
Hi @dongjoon-hyun , Sorry that I misunderstood this PR. What I came across is that there are always some event logs cannot be seen from SHS.
These 49 event logs maybe ok because when I
This PR is skipping these 49 apps not resolving them. |
What changes were proposed in this pull request?
This PR adds a
FileNotFoundException
try catch block while adding a new entry to history server application listing to skip the non-existing path.Why are the changes needed?
If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from
foo.inprogress
tofoo
.It leads to a problem when adding an entry to the listing, querying file status like
fileSizeForLastIndex
will throw out aFileNotFoundException
exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log.Does this PR introduce any user-facing change?
No
How was this patch tested?
File does not exist
error log was gone.