Skip to content

get_table_names() returns incorrect list of table names when querying Spark SQL Thriftserver #150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rezabaktash opened this issue Aug 28, 2017 · 15 comments

Comments

@rezabaktash
Copy link

Hi,
I found out that the method get_table_names() in class HiveDialect in sqlalchemy_hive.py does not return correct names of tables. The result of query show tables are 3-tuples in the form (schema_name , table_name , isTemporary). You need to get index 1 for table_names, but you get index 0. So it returns a list of duplicates of the schema_name.
What can we do to fix this?

@jingw
Copy link
Contributor

jingw commented Aug 28, 2017 via email

@mistercrunch
Copy link

Referencing the line after checking it out:

return [row[0] for row in connection.execute(query)]

Does Spark SQL return the same format regardless of SHOW TABLES IN {schema} and USE {schema}; SHOW TABLES; ?

@rezabaktash
Copy link
Author

@mistercrunch No! They both have the same format.

@rezabaktash
Copy link
Author

rezabaktash commented Aug 29, 2017

@jingw Is it possible to add a third option [sparksql] to pyhive alongside [hive] and [presto]? I know it would be a heavily redundant code. But it seems necessary for Spark to include database/schema name in SHOW TABLES result, because Sparks' temp tables don't have a schema and so they are returned for every SHOW TABLES query and they need to be distinguished from others.
Meanwhile we can ask Spark to decide about SHOW TABLES command and its Hive compatibility. Whether to support or not support it.

@mistercrunch
Copy link

@jingw how about
return [row[-1] for row in connection.execute(query)] ?

Would you approve such a PR?

@jingw
Copy link
Contributor

jingw commented Aug 29, 2017

Nice, that sounds easier than what I was thinking :)

@jingw
Copy link
Contributor

jingw commented Aug 29, 2017

I guess that'll break if someone decides to add another column (maybe show tables includes the table comment some day). Oh well. Both approaches I can think of (column names, column index) have been problematic.

Re spark dialect, I don't currently use sparksql, so I'd rather not add extra modules or branching logic that I can't test.

@rezabaktash
Copy link
Author

@mistercrunch show tables returns three fields as I mentioned in the first comment (schema_name , table_name , isTemporary). row[-1] will not return table name.

@jingw
Copy link
Contributor

jingw commented Aug 29, 2017

I guess that leaves us with having a list of possible column names. #68 changed from row.tab_name to row[0] due to some incompatibility. We could crawl through the Hive/Spark source code for every possible name of this field. Alternatively we could branch on len(row), but that seems even more fragile. Ultimately everything's a workaround for the root inconsistency.

@jingw
Copy link
Contributor

jingw commented Aug 29, 2017

That feels fragile / magical to me :/

@mistercrunch
Copy link

mistercrunch commented Aug 30, 2017

Arguably we should be using the Metastore thrift client for the metadata fetching, but then we'd need to overload the connection string with extra parameters to connect to the Metastore. Not ideal either.

What about create a new Pypi package for SQLAlchemy dialect sparksql:// that would essentially derive the hive:// dialect here and override this one method here. It would allow for it to diverge more in the future if needed.

@rezabaktash
Copy link
Author

So, what do we do now? :D

@maver1ck
Copy link

maver1ck commented May 29, 2018

Can we merge PR and close this bug ?

ShirmineWang pushed a commit to ShirmineWang/incubator-superset that referenced this issue Jul 2, 2018
Spark SQL's `show tables` query returns 3 columns instead of 2
in Hive. As such, a temporary workaround is made such that the
get_table_names function does not break.

PyHive developers are currently still on the fence on fixing this issue.
This commit can be reverted once they decided that they will fix this.

Reference: dropbox/PyHive#150
@tooptoop4
Copy link

bump

@kwontaeheon
Copy link

kwontaeheon commented Mar 23, 2021

Based on above commits, changed small part of "./superset/models/core.py"
This is a workaround though, it works.

def get_all_table_names_in_schema(
        self,
        schema: str,
        cache: bool = False,
        cache_timeout: Optional[int] = None,
        force: bool = False,
    ) -> List[utils.DatasourceName]:
         """Parameters need to be passed as keyword arguments.

        For unused parameters, they are referenced in
        cache_util.memoized_func decorator.

        :param schema: schema name
        :param cache: whether cache is enabled for the function
        :param cache_timeout: timeout in seconds for the cache
        :param force: whether to force refresh the cache
        :return: list of tables
        """
        try:
            # Workaround for table names in Spark Thrift server 
            if self.db_engine_spec == db_engine_specs.hive.HiveEngineSpec:
                engine =  self.get_sqla_engine()
                if schema:
                    rs = engine.execute('show tables in %s' % schema).fetchall()
                else:
                    rs = engine.execute('show tables').fetchall()
                return  [utils.DatasourceName(table=x[1], schema=x[0]) for x in rs]

            tables = self.db_engine_spec.get_table_names(
                database=self, inspector=self.inspector, schema=schema
            )

            return [
                utils.DatasourceName(table=table, schema=schema) for table in tables
            ]
        except Exception as ex:  # pylint: disable=broad-except
            logger.warning(ex)

    @cache_util.memoized_func(
        key=lambda self, schema, *args, **kwargs: f"db:{self.id}:schema:{schema}:view_list",  # type: ignore
        cache=cache_manager.data_cache,
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants