Skip to content

Conversation

@XuTingjun
Copy link
Contributor

Every time while executor fetching a jar from httpserver, a lock file and a cache file will be created on the local. After fetching, this two files will be useless.
And when the jar package is big, the cache file also be big. it wates the disk space.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Feb 12, 2015

Isn't the point that the files should stick around for future callers? The file is not recopied and lock is not recreated if it exists. (You would need a JIRA for this anyway, but first let's clear up this question.)

@XuTingjun XuTingjun changed the title [Core][Improvement] Delelte no longer used file [SPARK-5764] Delete the cache and lock file after executor fetching the jar Feb 12, 2015
@XuTingjun
Copy link
Contributor Author

val cachedFileName = s"${url.hashCode}${timestamp}_cache"

The cache file is named with url.hashCode and timestamp. No cache file of a jar will be the same with it. So it will not be called for future caller

@srowen
Copy link
Member

srowen commented Feb 12, 2015

The idea is that this uniquely determines the file and even a version of that file. That by itself is sound. Timestamp is not always "the current time". Look at the invocation in Executor.scala. I'm not as sure about the invocation in SparkContext.scala since it also does a fetch locally, with the current time, and that is always a 'cache miss', but I think that one is by design? But for the executor it looks correct at first glance since it uses timestamp as a sort of version key, where the timestamp is the time this particular file was added by the driver.

@XuTingjun
Copy link
Contributor Author

In SparkContext.scala, the useCache is false, so it won't use the cached file

@srowen
Copy link
Member

srowen commented Feb 12, 2015

Ah right of course. So, the executor is keying the cache on (hash of) URL and 'version', where version is the driver's timestamp. That would be the same for executors across the same app, and that's the purpose of this cache. Right?

@XuTingjun
Copy link
Contributor Author

Do you mean, the executors on the same node will use the cached file? I think it's right.

@srowen
Copy link
Member

srowen commented Feb 12, 2015

That looks like the intent, from the comment. These files should ultimately be deleted when the executor stops. Do you think there is a problem in light of this?

@XuTingjun
Copy link
Contributor Author

I think the cache file should be deleted when the app is finished, not executor stops.

@srowen
Copy link
Member

srowen commented Feb 12, 2015

Executors are per-app, so this is roughly the same thing?

@XuTingjun
Copy link
Contributor Author

I think we should consider the dynamic executor allocation, right?

@XuTingjun XuTingjun closed this Feb 12, 2015
@srowen
Copy link
Member

srowen commented Feb 12, 2015

Yeah, good point. Actually, ignore my comment. The executors stick this file in SparkFiles.getRootDirectory and that is not necessarily deleted by the executor. I mean, it's not necessarily even shared.

My point was that they should not be immediately deleted, at least. They do serve a purpose in some cases.

@XuTingjun XuTingjun deleted the patch branch February 17, 2015 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants