Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19705][SQL] Preferred location supporting HDFS cache for FileS… #17035

Closed
wants to merge 1 commit into from

Conversation

tanejagagan
Copy link

@tanejagagan tanejagagan commented Feb 23, 2017

…canRDD

Added support of HDFS cache using TaskLocation.inMemoryLocationTag
NewHadoopRDD and HadoopRDD both support HDFS cache using TaskLocation.inMemoryLocationTag
where "hdfs_cache_" is added to hostname which is then interpretted by scheduler
With this enhacement same tag ("hdfs_cache_") will be added to hostname if FilePartition only contains single file and the file is cached on one or more host
Current implementation would not cased where FilePartition would have multiple files as preferredLocation calculation is more complex.

What changes were proposed in this pull request?

Added support of HDFS cache using TaskLocation.inMemoryLocationTag
NewHadoopRDD and HadoopRDD both support HDFS cache using TaskLocation.inMemoryLocationTag
where "hdfs_cache_" is added to hostname which is then interpretted by scheduler
With this enhacement same tag ("hdfs_cache_") will be added to hostname if FilePartition only contains single file and the file is cached on one or more host
Current implementation would not work where FilePartition would have multiple files as preferredLocation calculation is more complex.

How was this patch tested?

Add unit tests at FileSourceStrategySuite.scala

Please review http://spark.apache.org/contributing.html before opening a pull request.

@highfei2011
Copy link

highfei2011 commented Feb 24, 2017

Preferred Location calculation is more complex, reflected in the code which part of it?

…canRDD

Added support of HDFS cache using TaskLocation.inMemoryLocationTag
NewHadoopRDD and HadoopRDD both support HDFS cache using TaskLocation.inMemoryLocationTag
where "hdfs_cache_" is added to hostname which is then interpretted by scheduler
With this enhacement same tag ("hdfs_cache_") will be added to hostname if FilePartition only contains single file and the file is cached on one or more host
Current implementation would not cased where FilePartition would have multiple files as preferredLocation calculation is more complex.
@tanejagagan
Copy link
Author

@hvanhovell
Can you help me with this pull request

val nonCachedLocations = files.head.locations.filterNot { x =>
files.head.cachedLocations.contains(x)
}.map(HostTaskLocation(_).toString)
return cachedLocations ++ nonCachedLocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to rewrite this to avoid using return? thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 16, 2020
@github-actions github-actions bot closed this Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants