Skip to content

Conversation

@asl3
Copy link
Contributor

@asl3 asl3 commented Dec 11, 2024

What changes were proposed in this pull request?

Support DESCRIBE TABLE ... [AS JSON] to optionally display table metadata in JSON format.

SQL Ref Spec:

{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] } [ AS JSON ]

Output:
json_metadata: String

Why are the changes needed?

The Spark SQL command DESCRIBE TABLE displays table metadata in a DataFrame format geared toward human consumption. This format causes parsing challenges, e.g. if fields contain special characters or the format changes as new features are added.
The new AS JSON option would return the table metadata as a JSON string that supports parsing via machine, while being extensible with a minimized risk of breaking changes. It is not meant to be human-readable.

Does this PR introduce any user-facing change?

Yes, this provides a new option to display DESCRIBE TABLE metadata in JSON format. See below (and updated golden files) for the JSON output schema:

{
      "table_name": "<table_name>",
      "catalog_name": "<catalog_name>",
      "schema_name": "<innermost_schema_name>",
      "namespace": ["<innermost_schema_name>"],
      "type": "<table_type>",
      "provider": "<provider>",
      "columns": [
        {
          "name": "<name>",
          "type": <type_json>,
          "comment": "<comment>",
          "nullable": <boolean>,
          "default": "<default_val>"
        }
      ],
      "partition_values": {
        "<col_name>": "<val>"
      },
      "location": "<path>",
      "view_text": "<view_text>",
      "view_original_text": "<view_original_text>",
      "view_schema_mode": "<view_schema_mode>",
      "view_catalog_and_namespace": "<view_catalog_and_namespace>",
      "view_query_output_columns": ["col1", "col2"],
      "owner": "<owner>",
      "comment": "<comment>",
      "table_properties": {
        "property1": "<property1>",
        "property2": "<property2>"
      },
      "storage_properties": {
        "property1": "<property1>",
        "property2": "<property2>"
      },
      "serde_library": "<serde_library>",
      "input_format": "<input_format>",
      "output_format": "<output_format>",
      "num_buckets": <num_buckets>,
      "bucket_columns": ["<col_name>"],
      "sort_columns": ["<col_name>"],
      "created_time": "<timestamp_ISO-8601>",
      "last_access": "<timestamp_ISO-8601>",
      "partition_provider": "<partition_provider>"
}

How was this patch tested?

  • Updated golden files for describe.sql
  • Added tests in DescribeTableParserSuite.scala, DescribeTableSuite.scala, PlanResolutionSuite.scala

Was this patch authored or co-authored using generative AI tooling?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 36d23ef Jan 7, 2025
@dongjoon-hyun
Copy link
Member

I made a follow-up.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-50541] Describe Table As JSON [SPARK-50541][SQL] Describe Table As JSON Jan 7, 2025
dongjoon-hyun added a commit that referenced this pull request Jan 7, 2025
…rd-coded version strings

### What changes were proposed in this pull request?

This is a follow-up to use `SPARK_VERSION` instead of hard-coded version strings.
- #49139

### Why are the changes needed?

Hard-coded version strings will cause unit test failures from next week during Apache Spark 4.0.0 RC and maintenance releases like 4.0.1-SNAPSHOT.

**BEFORE**
```
$ git grep 'created_by = Some("Spark '
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala:        created_by = Some("Spark 4.0.0-SNAPSHOT"),
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala:        created_by = Some("Spark 4.0.0-SNAPSHOT"),
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala:        created_by = Some("Spark 4.0.0-SNAPSHOT"),
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala:          created_by = Some("Spark 4.0.0-SNAPSHOT"),
sql/core/src/test/scala/org/apache/spark/sql/execution/command/v1/DescribeTableSuite.scala:        created_by = Some("Spark 4.0.0-SNAPSHOT"),
```

**AFTER**
```
$ git grep 'created_by = Some("Spark '
$
```

### Does this PR introduce _any_ user-facing change?

No, this is a test-case fix.

### How was this patch tested?

Pass the CIs and check manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #49401 from dongjoon-hyun/SPARK-50541.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Jan 15, 2025
### What changes were proposed in this pull request?

This is a follow-up of #49139 to use v2 command to simplify the code. Now we only need one logical plan and all the implementation is centralized to that logical plan, no need to touch other analyzer/planner rules.

### Why are the changes needed?

code simplification

### Does this PR introduce _any_ user-facing change?

no, this feature is not released yet.

### How was this patch tested?

update tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #49466 from cloud-fan/as-json.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants