Skip to content

[SPARK-27140][SQL]The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.#23950

Closed
beliefer wants to merge 2 commits intoapache:masterfrom
beliefer:test-insert-overwrite-noexist-local-path
Closed

[SPARK-27140][SQL]The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.#23950
beliefer wants to merge 2 commits intoapache:masterfrom
beliefer:test-insert-overwrite-noexist-local-path

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Mar 4, 2019

What changes were proposed in this pull request?

Maropu and I have some conversation about insert overwrite noexist local path.
In local[*] mode, maropu give a test case as follows:

$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-00000-bbea4213-071a-49b4-aac8-8510e7263d45-c000

This test case prove spark will create the not exists path and move middle result from local temporary path to created path.This test based on newest master.
I follow the test case provided by maropu,but find another behavior.
I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
Inconsistent behavior appears as follows:

ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = [] 

ls /tmp/noexistdir/t/
/tmp/noexistdir/t

vi /tmp/noexistdir/t
  1 

Then I pull the master branch and compile it and deploy it on my hadoop cluster.I get the inconsistent behavior again.
The spark version to test is 3.0.0.

ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector with the Serial old collector is deprecated and will likely be removed in a future release
Spark context Web UI available at http://10.198.66.204:55326
Spark context available as 'sc' (master = local[*], app id = local-1551259036573).
Spark session available as 'spark'.
Welcome to spark version 3.0.0-SNAPSHOT
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""select * from t""").show
+---+---+                                                                       
| c0| c1|
+---+---+
|  1|  1|
+---+---+


scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * from t""")
res1: org.apache.spark.sql.DataFrame = []                                       

scala> 
ll /tmp/noexistdir/t
-rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
vi /tmp/noexistdir/t
  1

The /tmp/noexistdir/t is a file too.

I want add a UT to master and need jenkins run it so that prove it or tell me some information.
UT results are the same as those of maropu's test, but different from mine.
The insert overwrite local directory will use LocalFileSystem. I have check the source of Hadoop LocalFileSystem .LocalFileSystem don't implement the method rename. LocalFileSystem extends ChecksumFileSystem and the latter implement the method rename.
The method rename of ChecksumFileSystem as follows:

  public boolean rename(Path src, Path dst) throws IOException {
    if (fs.isDirectory(src)) {
      return fs.rename(src, dst);
    } else {
      if (fs.isDirectory(dst)) {
        dst = new Path(dst, src.getName());
      }

      boolean value = fs.rename(src, dst);
      if (!value)
        return false;

      Path srcCheckFile = getChecksumFile(src);
      Path dstCheckFile = getChecksumFile(dst);
      if (fs.exists(srcCheckFile)) { //try to rename checksum
        value = fs.rename(srcCheckFile, dstCheckFile);
      } else if (fs.exists(dstCheckFile)) {
        // no src checksum, so remove dst checksum
        value = fs.delete(dstCheckFile, true); 
      }

      return value;
    }
  }

If target path is a directory, ChecksumFileSystem will move source file into target path.
If target path is not a directory, ChecksumFileSystem will rename source file to target file.
There exists a variable named fs that is a RawLocalFileSystem. RawLocalFileSystem will call the method rename of UNIXFileSystem or WinNTFileSystem.
I have tried to find out why UT and my spark behave differently when executing insert overwrite local directory in local mode. But I'm failed! According to the source code of InsertIntoHiveDirCommand, there no chance to create the target path that doesn't exist yet.Could you help me, find out the reason. Thanks!

How was this patch tested?

UT

@beliefer
Copy link
Contributor Author

beliefer commented Mar 4, 2019

@maropu I need your some help, please tell jenkins to test.

@SparkQA
Copy link

SparkQA commented Mar 4, 2019

Test build #4589 has finished for PR 23950 at commit afeb4a5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Mar 5, 2019

@maropu Please tell jenkins to test again, I want know the configuration of master. Thanks a lot!

@beliefer
Copy link
Contributor Author

beliefer commented Mar 6, 2019

@maropu I need your some help, please tell jenkins to test.

@SparkQA
Copy link

SparkQA commented Mar 6, 2019

Test build #4595 has finished for PR 23950 at commit de1749e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer beliefer changed the title [MINOR][SQL]Add a UT to test insert overwrite noexist local path. [SPARK-27140][SQL]The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment. Mar 13, 2019
@beliefer
Copy link
Contributor Author

@maropu @dongjoon-hyun Please help me,to find the reason.

@beliefer
Copy link
Contributor Author

beliefer commented Mar 14, 2019

cc @maropu @gatorsmile @dongjoon-hyun @janewangfb @cloud-fan
Please help me,to find the reason.Thanks a lot!

@beliefer
Copy link
Contributor Author

@maropu please review this PR again, thanks.

@beliefer
Copy link
Contributor Author

@srowen Maybe you could help me, to review this PR,thanks a lot!

@srowen
Copy link
Member

srowen commented Mar 19, 2019

Is there anything to review here? looks like you're trying to get it to fail, and there was a failure two weeks ago. We're having Jenkins trouble right now but you can retrigger it when it comes up.

@beliefer
Copy link
Contributor Author

beliefer commented Mar 20, 2019

Is there anything to review here? looks like you're trying to get it to fail, and there was a failure two weeks ago. We're having Jenkins trouble right now but you can retrigger it when it comes up.

Thanks a lot! I can always see your reply, as I expected.
The whole process is below:
First, the path /tmp/noexistdir is not exists and the following test cases occur in local mode.
Second, when Maropu was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in his environment, the not exists path would be created and result files and file _SUCCESS would be moved to the path.
Third, when I was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in my environment, the not exists path would be created as a single file and the single file is one of the result files.
Second, I created this PR to execute the same test case in UT. When UT was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in jenkins environment, the behavior is the same as the first test case.
Why the feature 'insert overwrite local directory' has an inconsistent behavior in different environment?
According to the source code of InsertIntoHiveDirCommand, there no chance to create the target path that doesn't exist yet.
Could you help me, find out the reason. Thanks!

@beliefer
Copy link
Contributor Author

@srowen Thanks a lot! I can always see your reply, as I expected.
The whole process is below:
First, the path /tmp/noexistdir is not exists and the following test cases occur in local mode.
Second, when Maropu was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in his environment, the not exists path would be created and result files and file _SUCCESS would be moved to the path.
Third, when I was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in my environment, the not exists path would be created as a single file and the single file is one of the result files.
Second, I created this PR to execute the same test case in UT. When UT was running insert overwrite local directory '/tmp/noexistdir/t' select * from t in jenkins environment, the behavior is the same as the first test case.
Why the feature 'insert overwrite local directory' has an inconsistent behavior in different environment?
According to the source code of InsertIntoHiveDirCommand, there no chance to create the target path that doesn't exist yet.
Could you help me, find out the reason. Thanks!

@beliefer
Copy link
Contributor Author

@dongjoon-hyun Could you review this PR and help me to find the reason? thanks.

@srowen
Copy link
Member

srowen commented Mar 26, 2019

I don't get it, is this related to #23841 ?
this test seems to fail. What are you expecting here?
I'm going to close this if you're just saying you don't know why.

@beliefer
Copy link
Contributor Author

I don't get it, is this related to #23841 ?
this test seems to fail. What are you expecting here?
I'm going to close this if you're just saying you don't know why.

I really want to know the reason of inconsistent behavior in different environment. I guess this maybe is a bug, because I can't find the chance to create the target path that doesn't exist yet.
This PR's content was originally descriped in #23841, I split #23841 and move the discussion to this PR.
I need you help to find the reason.

@srowen
Copy link
Member

srowen commented Mar 26, 2019

OK, I don't know if anyone can or will help though. Let's stick to your original PR. A PR isn't for investigation.

@srowen srowen closed this Mar 26, 2019
@beliefer
Copy link
Contributor Author

OK, I don't know if anyone can or will help though. Let's stick to your original PR. A PR isn't for investigation.

OK.maybe I could open a issue.

@srowen
Copy link
Member

srowen commented Mar 26, 2019

No, you already opened two. Let's stick to your other PR/JIRA

@beliefer
Copy link
Contributor Author

No, you already opened two. Let's stick to your other PR/JIRA

OK.

@beliefer beliefer deleted the test-insert-overwrite-noexist-local-path branch June 13, 2019 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants