-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20067] [SQL] Unify and Clean Up Desc Commands Using Catalog Interface #17394
Conversation
Test build #75082 has finished for PR 17394 at commit
|
Test build #75087 has finished for PR 17394 at commit
|
cc @cloud-fan |
@@ -97,16 +97,21 @@ DESC EXTENDED t | |||
struct<col_name:string,data_type:string,comment:string> | |||
-- !query 6 output | |||
# Detailed Table Information CatalogTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to wrap the information with CatalogTable(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Let me remove it and move it to the next line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output looks weird if we sort the Command, so file a PR: #17414
|-- a: string (nullable = true) | ||
|-- b: integer (nullable = true) | ||
|-- c: string (nullable = true) | ||
|-- d: string (nullable = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Do we still need Schema
, if the above info already has it? The only missing part is the nullability, which can be put in comment
or add a separate column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea we do not need, but I think CatalogTable.toString
should still have the schema. Actually why do we have DESC EXTENDED
? I think DESC FORMATTED
is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below is what Programming Hive
said:
Replacing EXTENDED with FORMATTED provides more readable but also more verbose output.
For most end-users, they should use DESC FORMATTED
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can keep the output of DESC EXTENDED
as it is. Then, we need to revisit DESC FORMATTED
to ensure it includes all the user-facing metadata info.
Test build #75185 has finished for PR 17394 at commit
|
Test build #75200 has finished for PR 17394 at commit
|
retest this please |
Test build #75221 has finished for PR 17394 at commit
|
@@ -273,28 +278,32 @@ case class CatalogTable( | |||
if (bucketColumnNames.nonEmpty) s"Bucket Columns: $bucketColumnsString" else "", | |||
if (sortColumnNames.nonEmpty) s"Sort Columns: $sortColumnsString" else "" | |||
) | |||
|
|||
case _ => Nil | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should de-duplicate the codes.
Test build #75232 has finished for PR 17394 at commit
|
Test build #75243 has finished for PR 17394 at commit
|
@@ -21,6 +21,7 @@ import java.net.URI | |||
import java.util.Date | |||
|
|||
import com.google.common.base.Objects | |||
import scala.collection.mutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we put the imports of scala library above third-party.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change the order, can scalastyle pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, the import order is:
import java.util.Date
import scala.collection.mutable
import com.google.common.base.Objects
Dataset.scala
is an example.
Please update the PR description too since you update the title. |
after this PR, what's the difference between |
After refactoring, I think
Below is the output of
Below is the output of
|
Test build #75295 has started for PR 17394 at commit |
Test build #75294 has finished for PR 17394 at commit
|
seems |
Yes. We do not need to keep the difference between |
Test build #75440 has finished for PR 17394 at commit
|
@@ -214,6 +215,7 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { | |||
// Returns true if the plan is supposed to be sorted. | |||
def isSorted(plan: LogicalPlan): Boolean = plan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call it needSort
?
|
||
# Detailed Table Information | ||
Table `default`.`t` | ||
Created[removed by SQLQueryTestSuite] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks weird, how about Created [not included in comparison]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Location[removed by SQLQueryTestSuite]sql/core/spark-warehouse/t | ||
Partition Provider Catalog | ||
Partition Columns [`c`, `d`] | ||
Schema root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this information is duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me remove both
d string | ||
|
||
# Detailed Table Information | ||
Table `default`.`t` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we separate it into 2 lines, Database
and Table
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Detailed Partition Information CatalogPartition( | ||
Partition Values: [c=Us, d=1] | ||
Storage(Location: sql/core/spark-warehouse/t/c=Us/d=1) | ||
Partition Parameters:{}) | ||
a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a header here # col_name data_type comment
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
buffer: ArrayBuffer[Row]): Unit = { | ||
append(buffer, "", "", "") | ||
append(buffer, "Detailed Partition Information " + partition.toString, "", "") | ||
if (isExtended) describeFormattedDetailedPartitionInfo(table, metadata, partition, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but it looks weird that DESC tbl
and DESC tbl PARTITION (xxx)
has the same result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function describeDetailedPartitionInfo
will only be called for the DDL command
DESCRIBE [EXTENDED|FORMATTED] table_name PARTITION (partitionVal*)
|
||
# Detailed Partition Information | ||
Database default | ||
Table t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but why do we put the Database
and Table
in # Detailed Partition Information
? Did we follow hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
hive> DESC EXTENDED page_view PARTITION(dt='part1', country='part2');
OK
viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string
# Partition Information
# col_name data_type comment
dt string
country string
Detailed Partition Information Partition(values:[part1, part2], dbName:default, tableName:page_view, createTime:1491066325, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:viewtime, type:int, comment:null), FieldSchema(name:userid, type:bigint, comment:null), FieldSchema(name:page_url, type:string, comment:null), FieldSchema(name:referrer_url, type:string, comment:null), FieldSchema(name:ip, type:string, comment:IP Address of the User), FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:country, type:string, comment:null)], location:file:/user/hive/warehouse/page_view/dt=part1/country=part2, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{totalSize=0, numRows=0, rawDataSize=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1491066325})
Test build #75446 has finished for PR 17394 at commit
|
} | ||
} | ||
|
||
case class SimpleDDLScan( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two classes are moved to DataSourceTest.scala
Test build #75455 has finished for PR 17394 at commit
|
|
||
|
||
-- !query 7 | ||
-- !query 10 | ||
DESC t PARTITION (c='Us', d=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same result with DESC t
, is it same with Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
hive> DESC page_view PARTITION(dt='part1', country='part2');
OK
viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string
# Partition Information
# col_name data_type comment
dt string
country string
Time taken: 0.12 seconds, Fetched: 13 row(s)
hive> DESC page_view;
OK
viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string
# Partition Information
# col_name data_type comment
dt string
country string
Time taken: 0.022 seconds, Fetched: 13 row(s)
Partition Values [c=Us, d=1] | ||
Location [not included in comparison]sql/core/spark-warehouse/t/c=Us/d=1 | ||
|
||
# Table Storage Information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we have a new # Table Storage Information
section for DESC EXTENDED t PARTITION (xxx)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive does it. See the following output.
hive> DESC FORMATTED page_view PARTITION(dt='part1', country='part2');
OK
# col_name data_type comment
viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
# Partition Information
# col_name data_type comment
dt string
country string
# Detailed Partition Information
Partition Value: [part1, part2]
Database: default
Table: page_view
CreateTime: Sat Apr 01 17:05:25 UTC 2017
LastAccessTime: UNKNOWN
Location: file:/user/hive/warehouse/page_view/dt=part1/country=part2
Partition Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1491066325
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.124 seconds, Fetched: 39 row(s)
hive> DESC EXTENDED page_view PARTITION(dt='part1', country='part2');
OK
viewtime int
userid bigint
page_url string
referrer_url string
ip string IP Address of the User
dt string
country string
# Partition Information
# col_name data_type comment
dt string
country string
Detailed Partition Information Partition(values:[part1, part2], dbName:default, tableName:page_view, createTime:1491066325, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:viewtime, type:int, comment:null), FieldSchema(name:userid, type:bigint, comment:null), FieldSchema(name:page_url, type:string, comment:null), FieldSchema(name:referrer_url, type:string, comment:null), FieldSchema(name:ip, type:string, comment:IP Address of the User), FieldSchema(name:dt, type:string, comment:null), FieldSchema(name:country, type:string, comment:null)], location:file:/user/hive/warehouse/page_view/dt=part1/country=part2, inputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{totalSize=0, numRows=0, rawDataSize=0, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numFiles=0, transient_lastDdlTime=1491066325})
Time taken: 0.116 seconds, Fetched: 15 row(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive use # Storage Information
, which seems to be the storage of the partition, not the table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Just checked the source code of Hive. This section in Hive is the storage for Partition, but the info was originally from the table storage, as shown in the codes. We did the same thing.
The only difference is our BucketSpec
is not part of CatalogStorageFormat
. Should we follow Hive to move it into CatalogStorageFormat
? Or do it as a follow-up? After the move, we can easily output the exactly the same things as Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we rename this section to Storage Information
? We may have per-partition storage in the future, although for now this is always table storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me make the change.
LGTM |
Test build #75497 has finished for PR 17394 at commit
|
Thanks! Merging to master. |
Hi, All . |
Add the following line when issuing |
Yep. That one. |
It's reported at https://issues.apache.org/jira/browse/SPARK-20954 . |
What I mean is to remove |
Sure, we can remove it. Feel free to submit a PR. |
Thank you for advice. I'll make a PR soon. |
Hi, All. |
What changes were proposed in this pull request?
This PR is to unify and clean up the outputs of
DESC EXTENDED/FORMATTED
andSHOW TABLE EXTENDED
by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands likeSHOW TABLE EXTENDED
,DESC EXTENDED
andDESC FORMATTED
.In addition, by following what we did in Dataset API
printSchema
, we can usetreeString
to show the schema in the more readable way.Below is the current way:
After the change, it should look like
How was this patch tested?
describe.sql
andshow-tables.sql