Skip to content
This repository has been archived by the owner on May 15, 2023. It is now read-only.

schema doesn't change from default #2

Closed
jonathan-dufault-kr opened this issue Jan 26, 2022 · 14 comments · Fixed by #4
Closed

schema doesn't change from default #2

jonathan-dufault-kr opened this issue Jan 26, 2022 · 14 comments · Fixed by #4

Comments

@jonathan-dufault-kr
Copy link

jonathan-dufault-kr commented Jan 26, 2022

sorry if this is a really dumb/misguided question. if it is, I am hoping you could point me to a place where I can learn more.

I set up an engine like

engine = create_engine( f"databricks+connector://token:{token}@{host}:443/{my_schema}", connect_args={"http_path": http_path} )

but engine.table_names() and anything I try to do with that engine have the default schema tables.

I have to workaround by doing a schema translation, but that can't be the right way to do this, or is it?

engine1 = engine.execution_options(schema_translate_map={ "default": my_schema })

edit: whoops put some private data in there

@jonathan-dufault-kr
Copy link
Author

nope the workaround doesn't work. this is just broken and I have no idea how to fix it

@crflynn
Copy link
Owner

crflynn commented Jan 26, 2022

Databricks connection urls are a bit different from other database urls but what we do here is for adherence to the sqlalchemy spec. Maybe the implementation is non-intuitive or confusing (or even wrong) but the value for the <database> is ignored by the connector, which only cares about the http_path and the token.

Generally if you want a specific schema (databricks calls them databases) you have to pass it to methods like this engine.table_names(schema="myschema")

@crflynn
Copy link
Owner

crflynn commented Jan 26, 2022

We might actually want to add a USE SCHEMA <schema> call here on creating the connection. It might be that the sql connector doesn't do this, but the connector is not open source so I can't request a change there. I will take a deeper look.

@crflynn
Copy link
Owner

crflynn commented Jan 26, 2022

Can you try doing something like

engine.execute("use myschema")

as a workaround?

@jonathan-dufault-kr
Copy link
Author

jonathan-dufault-kr commented Jan 27, 2022

I'll try that. In the meantime this is what I've done that's been working...in the MetaData object I can specify a schema and then use that to tell sqlalchemy to use the engine but override the schema.

I reopened this if that's alright. It keeps it on my radar.

That's how I'm understanding it. This aspect of sqlalchemy I'm learning just now. Before I've done creating an engine and then purely doing pd.read_sql(text(sql_query_text),engine), and I was using an odbc connection. With switching to databricks I'm trying to learn the sqlalchemy/pythonic way of doing things, so you're getting a double-whammy.

from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.automap import automap_base


# generate metadata and be able to tell which schema to use
meta=MetaData(bind=engine,schema="analytics_layer")

# below is just because none of the tables in this schema have primary keys I live in hell 
# rough and ready to get it to work. I think I can make a composite key
table_of_interest = Table("table_in_database", meta,Column('fake_primary_key_column', String, primary_key=True),extend_existing=True, autoload=True,autoload_with=engine)

# I can create a base with the metadata object
Base=automap_base(metadata=meta)


# reflecting the tables from that metdata object. specifying the reflection options
# because this layer has hundreds of tables
Base.prepare(engine, reflect=True,, reflection_options = {"only": ["table_in_database"]})
Table_Of_Interest= Base.classes.table_in_database

# Since Base is made with the schema I believe if I were to create tables on it then it would go in the right schema.
# I am doing read only so I haven't really tested that. I believe creating tables in this layer
# would get me promoted to customer.

This is an even quicker/dirtier method that I'm actually using since the process I'm doing right now doesn't require anything too complicated by way of the ORM

# generate metadata and be able to tell which schema to use
meta=MetaData(bind=engine,schema="analytics_layer")

Table_Of_Interest = Table("table_in_database",meta,autoload=True,autoload_with=engine)
session = Session(bind=engine)

query = session.query(Table_Of_Interest).distinct().limit(10)

edit: fix code blocks
edits 2 and 3: still fixing code blocks

@crflynn
Copy link
Owner

crflynn commented Jan 27, 2022

When I used pyhive underneath this would have worked fine because pyhive calls use schema on connection here: https://github.com/dropbox/PyHive/blob/d199a1bd55c656b5c28d0d62f2d3f2e6c9a82a54/pyhive/hive.py#L256

Apparently the databricks sql connector doesnt do this but since the code isn't public I can't request a change. I did ask if they have plans on making the code public.

@susodapop
Copy link
Contributor

susodapop commented Mar 16, 2022

The code for databricks-sql-connector isn't hosted on a public Git repository. But it's still Apache licensed (open source) and accessible because it's hosted on pypi. If you pip install databricks-sql-connector you can see all of its source code in your site-packages directory.

[edit] Also the code includes a HISTORY file. You can download it the .tar.gz file from here, unzip.

[edit2] Just for giggles I created https://github.com/susodapop/databricks-sql-connector-mirror which mirrors the codebase so you can examine it in github.

[edit3] I set up the versions in github so you can also observe the change from one version to the next

@crflynn
Copy link
Owner

crflynn commented Mar 16, 2022

Thanks for mirroring it. I didn't have any trouble seeing the code, I just wasn't able to PR an upstream change since it's not maintained publicly. I reached out to DB though and they said it will be made public "eventually" so we'll see.

@crflynn
Copy link
Owner

crflynn commented Mar 16, 2022

@crflynn
Copy link
Owner

crflynn commented Mar 16, 2022

Also from history:

v2.0.0b - February 8, 2022
- Initial beta release of V2. V2 is an internal re-write of large parts of the connector to use Databricks edge features. All public APIs from V1 remain.
- Added Unity Catalog support (pass catalog and / or  schema key word args to the .connect method to select initial schema and catalog)

so I assume that when 2.0 is released we can pass the catalog+schema and it will set those on creation.

@susodapop
Copy link
Contributor

susodapop commented Mar 17, 2022

Yes I believe so. I'm developing the same behaviour in an unrelated context using the v2 beta and it works like you describe. Prior to the v2 release of databricks-sql-connector I've been automatically pre-pending USE f{schema} to every call, which is inelegant but has done the trick.

@ebarault
Copy link

ebarault commented Apr 1, 2022

just figured this before finding this thread.
I'm also doing USE f{schema}.

And for Unity Catalog, i'm doing USE CATALOG xxx.

note: if you're not using plain sql, but using sqlAlchemy helpers, you can also pass the schema name as a named argument. (EDIT: Which @crflynn already referred to)

@ebarault
Copy link

ebarault commented Apr 7, 2022

@susodapop @crflynn hi,
pip install python-sql-connector==2.0.0b1

in this version:

  • honor the value passed in for database/schema: 👍
  • honor the new 3L namespace used in Unity Catalog 👍
  • allow to specify the catalog 👎

According to Databricks, they are planning to opensource the python-sql-connector lib

@ebarault
Copy link

ebarault commented Apr 27, 2022

@susodapop @crflynn I don't know if you noticed, but python-sql-connector 2.0.1 is out

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants