Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6879][HistoryServer]check if app is completed before clean it up #5491

Closed
wants to merge 8 commits into from

Conversation

WangTaoTheTonic
Copy link
Contributor

https://issues.apache.org/jira/browse/SPARK-6879

Use applications to replace FileStatus, and check if the app is completed before clean it up.
If an exception was throwed, add it to applications to wait for the next loop.

@srowen
Copy link
Member

srowen commented Apr 13, 2015

CC @vanzin @viper-kun I think this makes sense, although it does change the logic slightly. Now, log cleanup happens based on the application's state, rather than the modification time of the log files. This could be a good thing, and not deleting incomplete apps sounds good too.

@WangTaoTheTonic
Copy link
Contributor Author

Fixed the wrong path to delete and I have tested on my cluster, it worked fine.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30159 has finished for PR 5491 at commit fdef4d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 13, 2015

Test build #30166 has finished for PR 5491 at commit 9872a9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@vanzin
Copy link
Contributor

vanzin commented Apr 13, 2015

Question: did you actually run into this problem or is this theoretical?

Because I'd expect a live application to be updating the event log, in which case its modification time would be changing, and it would then never be caught by the log cleaner.

If the log is not being updated there's a pretty good chance the application is dead, and since we don't have shutdown hooks to stop the SparkContext, the "inprogress" files would be left around indefinitely with this change.

@squito
Copy link
Contributor

squito commented Apr 13, 2015

@vanzin as a counterpoint -- I can easily imagine sparks applications that running in "jobserver" mode that sit idle for a long time between active jobs.

Can we just fix the problem with "inprogress" files from dead apps separately?

@vanzin
Copy link
Contributor

vanzin commented Apr 13, 2015

I can easily imagine sparks applications that running in "jobserver" mode that sit idle for a long time between active jobs.

I can see that, but I wonder it a different approach, where the app itself would explicitly keep the log's mod time updated (even if not writing anything to the logs), would be worth it in that case.

@WangTaoTheTonic
Copy link
Contributor Author

@vanzin
It is not just theoretical. I tested using a ThriftServer instance, before this patch its event log is deleted by cleaner(when it expires).

As what you said in which case its modification time would be changing or where the app itself would explicitly keep the log's mod time updated (even if not writing anything to the logs), I'm afraid I am not sure if it acted like this. Anyway I will test it.

@WangTaoTheTonic
Copy link
Contributor Author

Okay I made an observation on my cluster, the thrift server is started at 21:01:32 and it hadn't do anything from that. Its evnet log's modification time is 21:01 too(while over 12 hours passed).

15/04/13 21:01:10 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/04/13 21:01:10 INFO Client: Application report for application_1428917164774_0022 (state: RUNNING)
15/04/13 21:01:10 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: doggie153
ApplicationMaster RPC port: 0
queue: default
start time: 1428930066890
final status: UNDEFINED
tracking URL: http://doggie153:8088/proxy/application_1428917164774_0022/
user: root
15/04/13 21:01:10 INFO YarnClientSchedulerBackend: Application application_1428917164774_0022 has started running.
15/04/13 21:01:11 INFO NettyBlockTransferService: Server created on 46518
..........
15/04/13 21:01:32 INFO ObjectStore: ObjectStore, initialize called
15/04/13 21:01:32 INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: "@" (64), after : "".
15/04/13 21:01:32 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
15/04/13 21:01:32 INFO ObjectStore: Initialized ObjectStore
15/04/13 21:01:32 INFO AbstractService: Service:ThriftBinaryCLIService is started.
15/04/13 21:01:32 INFO AbstractService: Service:HiveServer2 is started.
15/04/13 21:01:32 INFO HiveThriftServer2: HiveThriftServer2 started
15/04/13 21:01:32 INFO ThriftCLIService: ThriftBinaryCLIService listening on 0.0.0.0/0.0.0.0:10000

doggie157:/opt/oss/hadoop-2.6.0/bin # ./hadoop fs -ls /sparkhistory
15/04/14 09:09:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rwxrwx--- 3 root supergroup 241 2015-04-13 21:01 /sparkhistory/application_1428917164774_0022.inprogress
doggie157:/opt/oss/hadoop-2.6.0/bin # date
Tue Apr 14 09:12:37 CST 2015
doggie157:/opt/oss/hadoop-2.6.0/bin #

@vanzin
Copy link
Contributor

vanzin commented Apr 14, 2015

where the app itself would explicitly keep the log's mod time updated

All I mean here is that EventLoggingListener could from time to time just call FileSystem.setTimes() on the underlying log to make it look like it's up-to-date. That way the HS won't clean up files if the application is really alive. I'm not sure, though, whether HDFS allows you to do that while you have an open file handle.

UPDATE: just tested the above and it seems to work on my version of HDFS. It adds some more logic to the event logger, but at least it prevents misbehaving applications from forever polluting the history server listing.

@WangTaoTheTonic
Copy link
Contributor Author

@vanzin Erhhh, It seems like another solution, but there are few questions:
1.It adds logic to event logger(more codes and more action)
2.It increase the pressure to HDFS, not sure if it is namenode friendly.
3.How should we set the frequency to call setTimes?

Think my solution is better except one issue left:
If user doesn't call sc.stop() in their app, then its event log will be never deleted. I am not sure if it is our responsibility to take care of this.

@vanzin
Copy link
Contributor

vanzin commented Apr 14, 2015

Yeah, I'm a little on the fence. Pressure on HDFS shouldn't be a problem - you don't need to call setTimes that often, just often enough that the HS wouldn't clean up the log.

On the other hand, users should fix their apps.

Anyway, I'll review the current patch as is. It was just an idea to try to still clean up these broken logs.

fs.delete(new Path(logDir + "/" + info.logPath), true)
} catch {
case t: IOException => logError(s"IOException in cleaning logs of ${info.logPath}", t)
appsToRetain += (info.id -> info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this line be indented? And the logError above me on its own line, at the same indent level as this one?

If feels a little weird to leave the app on the live list; in the case of old-style logs, a few files may have been deleted, while others remain, so the app would be "un-renderable". On the other hand, this means the code will try to clean up this app again later on, so I guess it's better this way.

@viper-kun
Copy link
Contributor

I think it is ok. User must call sc.stop(), if not, it just not delete some event log.

@WangTaoTheTonic
Copy link
Contributor Author

Used an extra Map(appsToClean) to store applications that need to be deleted and add it back to applicaitons once delete failed.
@vanzin Please check IIUC. THanks.

@SparkQA
Copy link

SparkQA commented Apr 14, 2015

Test build #30218 has finished for PR 5491 at commit 94adfe1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

case t: IOException => logError(s"IOException in cleaning logs of $dir", t)
case t: IOException =>
logError(s"IOException in cleaning logs of ${info.logPath}", t)
applications += (info.id -> info)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you don't want to modify applications like this because then the code is not thread-safe. The only modification you can make to applications is to set it to a different map.

Also, by doing this, the app would be inserted in the map in, potentially, the wrong order.

@WangTaoTheTonic
Copy link
Contributor Author

@vanzin Now I use an extra global ListBuffer to store the apps to clean. Update its content and delete its dirs/files in every clean round.

I know the elements in this ListBuffer could be type of Path or String for less space occupied. But for simple logic I just leave it as FsApplicationHistoryInfo.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30312 has finished for PR 5491 at commit b0abca5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 15, 2015

Test build #30314 has finished for PR 5491 at commit d7455d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@@ -21,6 +21,7 @@ import java.io.{IOException, BufferedInputStream, FileNotFoundException, InputSt
import java.util.concurrent.{ExecutorService, Executors, TimeUnit}

import scala.collection.mutable
import scala.collection.mutable.ListBuffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need to import this explicitly; use mutable.ListBuffer

// if path is a directory and set to true,
// the directory is deleted else throws an exception
fs.delete(dir.getPath, true)
val path = new Path(logDir + "/" + info.logPath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new Path(logDir, info.logPath)

@SparkQA
Copy link

SparkQA commented Apr 16, 2015

Test build #30390 has finished for PR 5491 at commit d4d5251.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@WangTaoTheTonic
Copy link
Contributor Author

@vanzin Added another temporary ListBuffer leftToClean to store the apps that wasn't deleted succesfully and avoid editing appsToClean in its iterator.

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30571 has finished for PR 5491 at commit cb45105.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

// Only directories older than the specified max age will be deleted
statusList.foreach { dir =>
val leftToClean = new mutable.ListBuffer[FsApplicationHistoryInfo]
appsToClean.foreach { info =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would probably be more efficient:

while (appsToClean.nonEmpty) {
  val info = appsToClean.last
  try {
    ...
    appsToClean.remove(appsToClean.size - 1)
  } catch {
    ...
  }
}

But probably not a big deal in this context.

One thing to note is that if someone adds logs with the wrong permissions, this code will never be able to delete them, so those logs will forever be in the appsToClean list. It might be worth it to treat AccessControlException especially here and just give up trying to clean up logs with the wrong permissions.

@vanzin
Copy link
Contributor

vanzin commented Apr 20, 2015

LGTM, just left a minor comment.

@WangTaoTheTonic
Copy link
Contributor Author

If we changed the way to iterate like you said, the delete operations may cost too much time here (even be stuck) in case DFS client throw IOException often (though with a very low possibility). So I'd keep it like now.

Treating ACE specially is good and I did it like in checkForLogs.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30623 has finished for PR 5491 at commit 4a533eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@WangTaoTheTonic
Copy link
Contributor Author

@vanzin Please take a look, thanks~

@vanzin
Copy link
Contributor

vanzin commented Apr 23, 2015

LGTM

@asfgit asfgit closed this in baa83a9 Apr 23, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
…t up

https://issues.apache.org/jira/browse/SPARK-6879

Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
If an exception was throwed, add it to `applications` to wait for the next loop.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:

4a533eb [WangTaoTheTonic] treat ACE specially
cb45105 [WangTaoTheTonic] rebase
d4d5251 [WangTaoTheTonic] per Marcelo's comments
d7455d8 [WangTaoTheTonic] slightly change when delete file
b0abca5 [WangTaoTheTonic] use global var to store apps to clean
94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
9872a9d [WangTaoTheTonic] use the right path
fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…t up

https://issues.apache.org/jira/browse/SPARK-6879

Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
If an exception was throwed, add it to `applications` to wait for the next loop.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes apache#5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:

4a533eb [WangTaoTheTonic] treat ACE specially
cb45105 [WangTaoTheTonic] rebase
d4d5251 [WangTaoTheTonic] per Marcelo's comments
d7455d8 [WangTaoTheTonic] slightly change when delete file
b0abca5 [WangTaoTheTonic] use global var to store apps to clean
94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
9872a9d [WangTaoTheTonic] use the right path
fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants