-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19833][SQL]remove SQLConf.HIVE_VERIFY_PARTITION_PATH, always return empty when the location does not exists #17176
Conversation
…eturn empty when the location does not exists
Test build #73991 has finished for PR 17176 at commit
|
4bb0e28
to
8128567
Compare
} | ||
// convert /demo/data/year/month/day to /demo/data/*/*/*/ | ||
def getPathPatternByPath(parNum: Int, tempPath: Path, partitionName: String): String = { | ||
// if the partition path does not end with partition name, we should not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the partition location has been altered to another location, we should not do this pattern, or we will list pattern files which does not belong to the partition
Test build #73998 has finished for PR 17176 at commit
|
Test build #73992 has finished for PR 17176 at commit
|
retest this please |
Test build #74016 has finished for PR 17176 at commit
|
why jenkins failed... |
Test build #74072 has finished for PR 17176 at commit
|
@@ -159,36 +159,11 @@ class HadoopTableReader( | |||
def verifyPartitionPath( | |||
partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]): | |||
Map[HivePartition, Class[_ <: Deserializer]] = { | |||
if (!sparkSession.sessionState.conf.verifyPartitionPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this pr https://github.com/apache/spark/pull/17187, read hive table which does not use stored by
will not use HiveTableScanExec
.
this function has a bug ,that if the partition path is custom path
- it will still do filter for all partition path in the parameter
partitionToDeserializer
, - it will scan the path which does not belong to the table ,e.g. custom path is
/root/a
and the partitionSpec isb=1/c=2
, this will lead to scan/
because of thegetPathPatternByPath
Test build #74106 has finished for PR 17176 at commit
|
Test build #74107 has finished for PR 17176 at commit
|
@windpiger If you do not have a bandwidth to continue it, how about closing it now? |
case (partition, partDeserializer) => | ||
val partPath = partition.getDataLocation | ||
val fs = partPath.getFileSystem(hadoopConf) | ||
fs.exists(partPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each partition sending an RPC request to the NameNode can result in poor performance
What changes were proposed in this pull request?
In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem.
this situation should always return emtpy and don't lead to the task failed, here we remove this conf.
And the function
verifyPartitionPath
has a bug ,that if the partition path is custom pathit will still do filter for all partition path in the parameter
partitionToDeserializer
,it will scan the path which does not belong to the table ,e.g. custom path is /root/a
and the partitionSpec is b=1/c=2, this will lead to scan / because of the getPathPatternByPath
How was this patch tested?
modify a test case