New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6879][HistoryServer]check if app is completed before clean it up #5491
Conversation
CC @vanzin @viper-kun I think this makes sense, although it does change the logic slightly. Now, log cleanup happens based on the application's state, rather than the modification time of the log files. This could be a good thing, and not deleting incomplete apps sounds good too. |
Fixed the wrong path to delete and I have tested on my cluster, it worked fine. |
Test build #30159 has finished for PR 5491 at commit
|
Test build #30166 has finished for PR 5491 at commit
|
Question: did you actually run into this problem or is this theoretical? Because I'd expect a live application to be updating the event log, in which case its modification time would be changing, and it would then never be caught by the log cleaner. If the log is not being updated there's a pretty good chance the application is dead, and since we don't have shutdown hooks to stop the SparkContext, the "inprogress" files would be left around indefinitely with this change. |
@vanzin as a counterpoint -- I can easily imagine sparks applications that running in "jobserver" mode that sit idle for a long time between active jobs. Can we just fix the problem with "inprogress" files from dead apps separately? |
I can see that, but I wonder it a different approach, where the app itself would explicitly keep the log's mod time updated (even if not writing anything to the logs), would be worth it in that case. |
@vanzin As what you said |
Okay I made an observation on my cluster, the thrift server is started at 21:01:32 and it hadn't do anything from that. Its evnet log's modification time is 21:01 too(while over 12 hours passed).
|
All I mean here is that UPDATE: just tested the above and it seems to work on my version of HDFS. It adds some more logic to the event logger, but at least it prevents misbehaving applications from forever polluting the history server listing. |
@vanzin Erhhh, It seems like another solution, but there are few questions: Think my solution is better except one issue left: |
Yeah, I'm a little on the fence. Pressure on HDFS shouldn't be a problem - you don't need to call On the other hand, users should fix their apps. Anyway, I'll review the current patch as is. It was just an idea to try to still clean up these broken logs. |
fs.delete(new Path(logDir + "/" + info.logPath), true) | ||
} catch { | ||
case t: IOException => logError(s"IOException in cleaning logs of ${info.logPath}", t) | ||
appsToRetain += (info.id -> info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this line be indented? And the logError
above me on its own line, at the same indent level as this one?
If feels a little weird to leave the app on the live list; in the case of old-style logs, a few files may have been deleted, while others remain, so the app would be "un-renderable". On the other hand, this means the code will try to clean up this app again later on, so I guess it's better this way.
I think it is ok. User must call sc.stop(), if not, it just not delete some event log. |
Used an extra Map( |
Test build #30218 has finished for PR 5491 at commit
|
case t: IOException => logError(s"IOException in cleaning logs of $dir", t) | ||
case t: IOException => | ||
logError(s"IOException in cleaning logs of ${info.logPath}", t) | ||
applications += (info.id -> info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you don't want to modify applications
like this because then the code is not thread-safe. The only modification you can make to applications
is to set it to a different map.
Also, by doing this, the app would be inserted in the map in, potentially, the wrong order.
@vanzin Now I use an extra global ListBuffer to store the apps to clean. Update its content and delete its dirs/files in every clean round. I know the elements in this ListBuffer could be type of |
Test build #30312 has finished for PR 5491 at commit
|
Test build #30314 has finished for PR 5491 at commit
|
@@ -21,6 +21,7 @@ import java.io.{IOException, BufferedInputStream, FileNotFoundException, InputSt | |||
import java.util.concurrent.{ExecutorService, Executors, TimeUnit} | |||
|
|||
import scala.collection.mutable | |||
import scala.collection.mutable.ListBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you don't need to import this explicitly; use mutable.ListBuffer
// if path is a directory and set to true, | ||
// the directory is deleted else throws an exception | ||
fs.delete(dir.getPath, true) | ||
val path = new Path(logDir + "/" + info.logPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: new Path(logDir, info.logPath)
Test build #30390 has finished for PR 5491 at commit
|
@vanzin Added another temporary ListBuffer |
Test build #30571 has finished for PR 5491 at commit
|
// Only directories older than the specified max age will be deleted | ||
statusList.foreach { dir => | ||
val leftToClean = new mutable.ListBuffer[FsApplicationHistoryInfo] | ||
appsToClean.foreach { info => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this would probably be more efficient:
while (appsToClean.nonEmpty) {
val info = appsToClean.last
try {
...
appsToClean.remove(appsToClean.size - 1)
} catch {
...
}
}
But probably not a big deal in this context.
One thing to note is that if someone adds logs with the wrong permissions, this code will never be able to delete them, so those logs will forever be in the appsToClean
list. It might be worth it to treat AccessControlException
especially here and just give up trying to clean up logs with the wrong permissions.
LGTM, just left a minor comment. |
If we changed the way to iterate like you said, the delete operations may cost too much time here (even be stuck) in case DFS client throw IOException often (though with a very low possibility). So I'd keep it like now. Treating ACE specially is good and I did it like in |
Test build #30623 has finished for PR 5491 at commit
|
@vanzin Please take a look, thanks~ |
LGTM |
…t up https://issues.apache.org/jira/browse/SPARK-6879 Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up. If an exception was throwed, add it to `applications` to wait for the next loop. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits: 4a533eb [WangTaoTheTonic] treat ACE specially cb45105 [WangTaoTheTonic] rebase d4d5251 [WangTaoTheTonic] per Marcelo's comments d7455d8 [WangTaoTheTonic] slightly change when delete file b0abca5 [WangTaoTheTonic] use global var to store apps to clean 94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted 9872a9d [WangTaoTheTonic] use the right path fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
…t up https://issues.apache.org/jira/browse/SPARK-6879 Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up. If an exception was throwed, add it to `applications` to wait for the next loop. Author: WangTaoTheTonic <wangtao111@huawei.com> Closes apache#5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits: 4a533eb [WangTaoTheTonic] treat ACE specially cb45105 [WangTaoTheTonic] rebase d4d5251 [WangTaoTheTonic] per Marcelo's comments d7455d8 [WangTaoTheTonic] slightly change when delete file b0abca5 [WangTaoTheTonic] use global var to store apps to clean 94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted 9872a9d [WangTaoTheTonic] use the right path fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
https://issues.apache.org/jira/browse/SPARK-6879
Use
applications
to replaceFileStatus
, and check if the app is completed before clean it up.If an exception was throwed, add it to
applications
to wait for the next loop.