Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6024][SQL] When a data source table has too many columns, it's schema cannot be stored in metastore. #4795

Closed
wants to merge 6 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Feb 26, 2015

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28020 has started for PR 4795 at commit 12bacae.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28022 has started for PR 4795 at commit cc1d472.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28020 has finished for PR 4795 at commit 12bacae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28020/
Test PASSed.

tbl.setProperty("spark.sql.sources.schema.numOfParts", "1")
// We use spark.sql.sources.schema instead of using spark.sql.sources.schema.part.0
// because users may have already created data source tables in metastore.
tbl.setProperty("spark.sql.sources.schema", schemaJsonString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we just always use schema.part.0 ? seems easier to consolidate the two code path

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28022 has finished for PR 4795 at commit cc1d472.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28022/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #28025 has started for PR 4795 at commit 143927a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28025 has finished for PR 4795 at commit 143927a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28025/
Test PASSed.

@@ -69,13 +69,19 @@ private[hive] class HiveMetastoreCatalog(hive: HiveContext) extends Catalog with
val table = synchronized {
client.getTable(in.database, in.name)
}
val schemaString = table.getProperty("spark.sql.sources.schema")
val schemaString = Option(table.getProperty("spark.sql.sources.schema.numOfParts")) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more conventional to use numParts instead of numOfParts. Also you can remove the pattern matching by just applying a map.

Option(table.getProperty("spark.sql.sources.schema.numParts")).map { numParts =>
  ...
}

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28031 has started for PR 4795 at commit 73e71b4.

  • This patch merges cleanly.

val part = table.getProperty(s"spark.sql.sources.schema.part.${index}")
if (part == null) {
throw new AnalysisException(
"Could not read schema from the metastore because it is corrupted.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for being picky, but it would be great to include the reason why it is corrupted (i.e. "missing part x")

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28031 has finished for PR 4795 at commit 73e71b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28031/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28043 has started for PR 4795 at commit 4882e6f.

  • This patch merges cleanly.

@rxin
Copy link
Contributor

rxin commented Feb 27, 2015

lgtm

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28043 has finished for PR 4795 at commit 4882e6f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28043/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Feb 27, 2015

Merging in!

asfgit pushed a commit that referenced this pull request Feb 27, 2015
… schema cannot be stored in metastore.

JIRA: https://issues.apache.org/jira/browse/SPARK-6024

Author: Yin Huai <yhuai@databricks.com>

Closes #4795 from yhuai/wideSchema and squashes the following commits:

4882e6f [Yin Huai] Address comments.
73e71b4 [Yin Huai] Address comments.
143927a [Yin Huai] Simplify code.
cc1d472 [Yin Huai] Make the schema wider.
12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore.
e9b4f70 [Yin Huai] Failed test.

(cherry picked from commit 5e5ad65)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 5e5ad65 Feb 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants