-
Notifications
You must be signed in to change notification settings - Fork 552
get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Spark thrift claims to be compatible with Hive, so I think the correct fix
is to edit the spark code. But if they don't want to match Hive, the next
best option is probably to figure out a hard coded list of column names
compatible with all spark and hive versions. In my opinion, this is
inferior since it pushes the problem down to every user of Spark.
…On Aug 27, 2017 11:13 PM, "Reza Baktash" ***@***.***> wrote:
Hi,
I found out that the method get_table_names() in class HiveDialect in
sqlalchemy_hive.py does not return correct names of tables. The result of
query show tables are 3-tuples in the form (schema_name , table_name ,
isTemporary). You need to get index 1 for table_names, but you get index
0. So it returns a list of duplicates of the schema_name.
What can we do to fix this?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#150>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB7QYtn6GirL_mQEDKuOiGWXWy3uTf2Qks5sclqkgaJpZM4PEHju>
.
|
Referencing the line after checking it out: PyHive/pyhive/sqlalchemy_hive.py Line 303 in 44717f7
Does Spark SQL return the same format regardless of |
@mistercrunch No! They both have the same format. |
@jingw Is it possible to add a third option [sparksql] to pyhive alongside [hive] and [presto]? I know it would be a heavily redundant code. But it seems necessary for Spark to include database/schema name in |
@jingw how about Would you approve such a PR? |
Nice, that sounds easier than what I was thinking :) |
I guess that'll break if someone decides to add another column (maybe Re spark dialect, I don't currently use sparksql, so I'd rather not add extra modules or branching logic that I can't test. |
@mistercrunch |
I guess that leaves us with having a list of possible column names. #68 changed from row.tab_name to row[0] due to some incompatibility. We could crawl through the Hive/Spark source code for every possible name of this field. Alternatively we could branch on len(row), but that seems even more fragile. Ultimately everything's a workaround for the root inconsistency. |
That feels fragile / magical to me :/ |
Arguably we should be using the Metastore thrift client for the metadata fetching, but then we'd need to overload the connection string with extra parameters to connect to the Metastore. Not ideal either. What about create a new Pypi package for SQLAlchemy dialect |
So, what do we do now? :D |
Can we merge PR and close this bug ? |
Spark SQL's `show tables` query returns 3 columns instead of 2 in Hive. As such, a temporary workaround is made such that the get_table_names function does not break. PyHive developers are currently still on the fence on fixing this issue. This commit can be reverted once they decided that they will fix this. Reference: dropbox/PyHive#150
bump |
Based on above commits, changed small part of "./superset/models/core.py"
|
Hi,
I found out that the method
get_table_names()
in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of queryshow tables
are 3-tuples in the form(schema_name , table_name , isTemporary)
. You need to get index 1 fortable_names
, but you get index 0. So it returns a list of duplicates of theschema_name
.What can we do to fix this?
The text was updated successfully, but these errors were encountered: