Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-11522][SQL] input_file_name() returns "" for external tables #9542

Closed
wants to merge 10 commits into from

Conversation

xwu0226
Copy link
Contributor

@xwu0226 xwu0226 commented Nov 7, 2015

When computing partition for non-parquet relation, HadoopRDD.compute is used. but it does not set the thread local variable inputFileName in NewSqlHadoopRDD, like NewSqlHadoopRDD.compute does.. Yet, when getting the inputFileName, NewSqlHadoopRDD.inputFileName is exptected, which is empty now.
Adding the setting inputFileName in HadoopRDD.compute resolves this issue.

@rxin
Copy link
Contributor

rxin commented Nov 8, 2015

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Nov 8, 2015

Test build #2014 has finished for PR 9542 at commit 2658f28.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 8, 2015

@rxin I pushed again for the scala style test issue. Will the test build be kicked off automatically or manually? Thanks!

@squito
Copy link
Contributor

squito commented Nov 9, 2015

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45330 has finished for PR 9542 at commit b5fa291.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 11, 2015

@rxin or @squito , what do you think about the fix? Thanks!

split.inputSplit.value match {
case fs: FileSplit => SqlNewHadoopRDD.setInputFileName(fs.getPath.toString)
case _ => SqlNewHadoopRDD.unsetInputFileName()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call SqlNewHadoopRDD.unsetInputFileName() in https://github.com/apache/spark/pull/9542/files#diff-83eb37f7b0ebed3c14ccb7bff0d577c2R257? Like what we do in SqlNewHadoopRDD?

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 14, 2015

@yhuai Thanks for pointing it out! I will make the change now.

@SparkQA
Copy link

SparkQA commented Nov 14, 2015

Test build #45911 has finished for PR 9542 at commit c27d030.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.distinct().collect().length == 1)
sql("DROP TABLE external_parquet")

// Non-External parquet pointing to /tmp/...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we do not need to say where it points to since it is a managed table.

@yhuai
Copy link
Contributor

yhuai commented Nov 15, 2015

@xwu0226 Looks good! I left a few comments regarding the format.

@SparkQA
Copy link

SparkQA commented Nov 15, 2015

Test build #45956 has finished for PR 9542 at commit fe2d6d8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 15, 2015

Accidentially pushed another JIRA's code together. . I am backing it out

@yhuai
Copy link
Contributor

yhuai commented Nov 15, 2015

LGTM pending jenkins.

@yhuai
Copy link
Contributor

yhuai commented Nov 15, 2015

@xwu0226 Sorry for asking you to update several times. I just realized that you added a bunch of files in sql/hive/src/test/resources/data/. Since that file is directly copied from hive, we do not change files or add files in there. Can we just generate some test files in the test? We can make HiveUDFSuite extend SQLTestUtils and then use withTempPath to generate temp dirs that can be used for those external tables.

@SparkQA
Copy link

SparkQA commented Nov 15, 2015

Test build #45959 has finished for PR 9542 at commit 4481c82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 15, 2015

@yhuai I did not know that we should not update the resources/data directory.. I thought the test data files were added along the way by contributors. Thanks for pointing it out! Let me update HiveUDFSuite then.

@yhuai
Copy link
Contributor

yhuai commented Nov 15, 2015

@xwu0226 Thank you!

@yhuai
Copy link
Contributor

yhuai commented Nov 16, 2015

oh seems there is a conflict...

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 16, 2015

@yhuai Is it mergable?

@yhuai
Copy link
Contributor

yhuai commented Nov 16, 2015

@xwu0226 Can you resolve the conflict? Once you update the pr and jenkins is good, I will merge it. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45970 has finished for PR 9542 at commit 83b1c77.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 16, 2015

test this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45979 has finished for PR 9542 at commit eeaa6b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2015

Test build #45977 has finished for PR 9542 at commit eeaa6b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 16, 2015

@yhuai The last test build passed. Do you know what might cause the previous errors? After resolving the conflicts, my own diff for this PR is still the same place, that passed test before. Hope it did not break anything. Thanks!

@yhuai
Copy link
Contributor

yhuai commented Nov 16, 2015

@xwu0226 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45977/consoleFull is good. I will merge it to master and branch 1.6.

@asfgit asfgit closed this in 0e79604 Nov 16, 2015
asfgit pushed a commit that referenced this pull request Nov 16, 2015
When computing partition for non-parquet relation, `HadoopRDD.compute` is used. but it does not set the thread local variable `inputFileName` in `NewSqlHadoopRDD`, like `NewSqlHadoopRDD.compute` does.. Yet, when getting the `inputFileName`, `NewSqlHadoopRDD.inputFileName` is exptected, which is empty now.
Adding the setting inputFileName in HadoopRDD.compute resolves this issue.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9542 from xwu0226/SPARK-11522.

(cherry picked from commit 0e79604)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@xwu0226
Copy link
Contributor Author

xwu0226 commented Nov 16, 2015

@yhuai Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants