Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6296] add ODBC hook & deprecation warning for pymssql #6850

Merged
merged 1 commit into from Jan 19, 2020

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Dec 19, 2019

Changes

  • add OdbcHook using pyodbc
  • mssql odbc support installable with apache-airflow[odbc] (requires pyodbc)
  • exising pymssql-based hook is left unchanged, but deprecation warning added
  • existing MsSqlOperator is updated to use either MsSql hook depending on conn_type.
    • this relies on Connection.get_hook, which can return a different hook based on conn type.
    • if conn_type='odbc' then MsSqlOdbcHook will be used; otherwise, the MsSqlHook is used.
    • this saves us from having to create another mssql operator, and from having to add a hook_type parameter
  • added docs and tests as appropriate

Jira

Description

  • [x]

see above

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@codecov-io
Copy link

codecov-io commented Dec 19, 2019

Codecov Report

Merging #6850 into master will decrease coverage by 0.14%.
The diff coverage is 92.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6850      +/-   ##
==========================================
- Coverage   85.41%   85.27%   -0.15%     
==========================================
  Files         753      711      -42     
  Lines       39685    39503     -182     
==========================================
- Hits        33898    33687     -211     
- Misses       5787     5816      +29
Impacted Files Coverage Δ
airflow/sensors/sql_sensor.py 100% <ø> (ø) ⬆️
airflow/models/connection.py 77.4% <100%> (+8.62%) ⬆️
airflow/hooks/mssql_hook.py 76.19% <100%> (+17.36%) ⬆️
airflow/utils/helpers.py 82.5% <100%> (+0.8%) ⬆️
airflow/utils/log/json_formatter.py 100% <100%> (ø) ⬆️
airflow/operators/mssql_operator.py 90.62% <85.71%> (+90.62%) ⬆️
airflow/providers/odbc/hooks/odbc.py 92.22% <92.22%> (ø)
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 45.25% <0%> (-46.72%) ⬇️
... and 160 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5abce47...2fed850. Read the comment docs.

Copy link
Member

@zhongjiajie zhongjiajie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents, BTW should we also change UPDATE.md to record this change?

setup.py Outdated Show resolved Hide resolved
airflow/hooks/mssql_odbc_hook.py Outdated Show resolved Hide resolved
airflow/providers/mssql/hooks/mssql_odbc.py Outdated Show resolved Hide resolved
airflow/providers/mssql/hooks/mssql_odbc.py Outdated Show resolved Hide resolved
@baolsen
Copy link
Contributor

baolsen commented Jan 2, 2020

(Per Slack reply)
I really feel that SQL Alchemy might be useful here. Using different dialects allows us to be backwards compatible and let the user decide whether to use pymssql or pyodbc (with the extra effort that it requires).
Or we could easily do dialect = pymssql (for Airflow 1.10.7) and pyodbc for Airflow 2

@dstandish
Copy link
Contributor Author

dstandish commented Jan 2, 2020

@baolsen

I really feel that SQL Alchemy might be useful here. Using different dialects allows us to be backwards compatible and let the user decide whether to use pymssql or pyodbc (with the extra effort that it requires).

it's an interesting idea but i think it actually increases complexity to go that way. then your hook logic, and the design of your connection object parsing, has to be able to handle arbitrary mssql libraries.
e.g. with pyodbc we can supply arbitrary odbc connection properties as key-value in conn.extra. what would you do with these in pymssql? you could do it, but i think it is cleaner and safer to keep things separate.
and where in the conn object would you put the dialect selector?
you'd still have to have a extra option for odbc support at install to enable pyodbc.
and it while this could potentially be used for switching between pyodbc and pymssql, it could not be used for turbodbc or bcp.

and i think that might be why conventionally in airflow it's the other way around -- hooks are defined in accordance with the idiosyncrasies of their connectors, and they can optionally generate a sqlalchemy connection. and this PR does provide support for that here: (https://github.com/apache/airflow/blob/1b3b907ed6e8a45affa269b624dfe420d74424ed/airflow/providers/mssql/hooks/mssql_odbc.py#L156)
(the get_uri method is used by dbapi hook to produce sqlalchemy engine; i also provide a get_sqlalchemy connection method)

@dstandish dstandish force-pushed the feature/mssql-odbc-hook branch 3 times, most recently from c7f1350 to 806b332 Compare January 4, 2020 22:41
@vamega
Copy link

vamega commented Jan 7, 2020

I'm not sure if this is the right place to ask this, but is there a reason this isn't generalised to a an odbc or perhaps more specifically a pyodbc hook? That way other databases that provide and odbc driver could also use this hook?

I'm specifically thinking of using this with the Exasol database, and having to create another hook that has a lot of this code duplciated seems wasteful.

@dstandish
Copy link
Contributor Author

@vamega I love this idea -- i thought about doing this initially ... but wasn't sure if there was need for that.

i am happy to refactor this as an odbc hook.

i wonder if we should still make a mssql odbc hook that is a subclass....

please let me know if you have any ideas, or if you think that there are changes that would be helpful for exasol

@vamega
Copy link

vamega commented Jan 7, 2020

@dstandish I'll take a look at the code today or tomorrow.

One of the things I'd like to support is to directly pass in the ODBC connection string. That removes the need for an odbc.ini file needing to be provisioned on the machine to specify the driver, as long a path to the odbc driver .so file can be specified.

I'm not sure how well known this technique is, but you can see a demonstration of it here:

@dstandish
Copy link
Contributor Author

dstandish commented Jan 7, 2020

One of the things I'd like to support is to directly pass in the ODBC connection string

@vamega the present design supports this :)

You can get some details from the doc.

Looking at the tests also demonstrates expected behavior.

I don't like having to mess with odbc.ini either. I support usage of DSN but not require it.

The basic approach though is this:

  • user / password / host / database / port are obtained from standard Connection attributes
  • any odbc params supplied in Connection.extra will be added to connection string
  • kwargs meant for pyodbc.connect can be supplied under key connect_kwargs in Connection.extra

@dstandish
Copy link
Contributor Author

dstandish commented Jan 7, 2020

also @vamega ... re your comment

I'm not sure how well known this technique is, but you can see a demonstration of it here

incidentally, i am already using this approach in the get_uri method. note, though, this just pertains to the sqlalchemy connection string, and is only used in the get_sqlalchemy_engine method inherited from DbApiHook.

the get_conn method does not rely on sqlalchemy, but uses the same property odbc_connection_string, where the odbc string is constructed.

@dstandish dstandish changed the title [AIRFLOW-6296] add mssql odbc hook [AIRFLOW-6296] add pyodbc-based odbc hooks OdbcHook and MsSqlOdbcHook Jan 8, 2020
@dstandish dstandish changed the title [AIRFLOW-6296] add pyodbc-based odbc hooks OdbcHook and MsSqlOdbcHook [AIRFLOW-6296] add odbc hooks OdbcHook and MsSqlOdbcHook Jan 8, 2020
@dstandish dstandish force-pushed the feature/mssql-odbc-hook branch 3 times, most recently from a7adbeb to f5305f2 Compare January 9, 2020 05:39
@dstandish
Copy link
Contributor Author

dstandish commented Jan 9, 2020

Alright... after further thought... i think it makes more sense to just call it odbc hook. there is no point in having a separate MsSqlOdbcHook that is identical apart from the name.

Having two identical hooks would be confusing for users and makes it difficult to create clear and consistent documentation.

@match-gabeflores
Copy link
Contributor

match-gabeflores commented Apr 13, 2020

Did this get pushed to 2.0? A one point it was slated for 1.10.10, judging from the history of Jira. Was it pushed because of potential impact?

https://issues.apache.org/jira/browse/AIRFLOW-6296

@potiuk
Copy link
Member

potiuk commented Apr 14, 2020

Looking at the history it was problematic to import to 1.10.10 and @kaxil decided to leave it in 2.0. While trying to cherry-pick we sometimes find that it's too risky/too problematic to cherry-pick and such candidate gets dropped from the list.

Do you think @gflores1023 it is needed/badly needed for 1.10 line? What's the reasoning? Is there no suitable workaround until 2.0 is available?

Maybe - if we hear that it is really important we might resume and make some extra effort to cherry-pick to 1.10.11? Or maybe - if it is really important but only for you - you could try to make a PR against v1-10-tests that you could work on and we could review?

@kaxil
Copy link
Member

kaxil commented Apr 14, 2020

Yes, this was indeed pushed back to 2.0. Like @potiuk mentioned is there a specific need for this operator which the current operators in 1.10.10 can't fulfill ?

@dstandish
Copy link
Contributor Author

dstandish commented Apr 15, 2020

one reason i thought it would be nice to cherry pick this is so that users could switch to odbc (from pymssql) ahead of, and independent of, the 2.0 upgrade.

i suspect that the reason you guys did not want to cherry pick is because it introduces deprecation of the pymssql-based mssqlhook, and therefore, without intervention, implicitly deprecates MSSQLToGCSOperator and MsSqlToHiveTransfer?

if so, @gflores1023, this is another way you could potentially contribute and help move this along. the only thing that needs to be sorted out is the type map --- from the pyodbc return types to gcs / hive. then these operators can be adapted to work with odbc hook also, in the same manner as was done with MsSqlOperator.

but i dunno maybe there was also some unrelated ambivalence about deprecating pymssql spport, or other unrelated concerns with the hook?

@dstandish
Copy link
Contributor Author

dstandish commented Apr 24, 2020

@potiuk @kaxil another potential reason to cherry pick this one is that it seems pymssql cannot be installed on python 3.8... this PR provides a path to install airflow with MS SQL Server support on python 3.8 (now the default version with ubuntu 20.04 LTS), without having to install pymssql

here is the relevant pymssql issue.

@potiuk
Copy link
Member

potiuk commented Apr 24, 2020

Yes. Why not. I will mark it as 1.10.11 - and I need to finish up the CI changes to think about adding 3.8 - we already exceed capacity for DockerHub to start running 3.8 tests, but after few more changes I am working on (and moving Kubernetes tests to Github Actions) we might want to think about adding 3.8 support officially (and maybe even backporting it to 1.10.* - but I would rather not do it for 1.10 to add more incentive to move to 2.0 ;).

@potiuk potiuk added this to the Airflow 1.10.11 milestone Apr 24, 2020
@dstandish
Copy link
Contributor Author

ah, the mythical 2.0 ;)

@kaxil
Copy link
Member

kaxil commented Apr 24, 2020

Yeah we can try to make Airflow 1.10.* Py 3.8 compatible but without any gurantees as the Test Suite is already very large for 1.10.* . We still support Python 2 :(

@ashb ashb mentioned this pull request May 18, 2020
6 tasks
@ldacey
Copy link
Contributor

ldacey commented Aug 19, 2020

Is it possible to use turbodbc with this as well, or would that have to be a separate custom hook?

@dstandish
Copy link
Contributor Author

dstandish commented Aug 19, 2020

Is it possible to use turbodbc with this as well, or would that have to be a separate custom hook?

Support could be added easily because the odbc conn str can be reused.

Internally what I did was add this method so you can choose turbodbc client optionally:

    def get_turbodbc_connection(self, turbodbc_options: Optional[dict] = None):
        # imported locally so turbodbc can be optional
        import turbodbc
        from turbodbc import make_options
        from turbodbc_intern import Megabytes

        default_options_kwargs = dict(
            read_buffer_size=Megabytes(300),
            large_decimals_as_64_bit_types=True,
            prefer_unicode=True,
            autocommit=True,
        )
        turbodbc_options = make_options(**{**default_options_kwargs, **(turbodbc_options or {})})
        return turbodbc.connect(connection_string=self.odbc_connection_string, turbodbc_options=turbodbc_options)

@ldacey
Copy link
Contributor

ldacey commented Aug 19, 2020

Thanks for the example! Perfect, let me try that as well.

@ldacey
Copy link
Contributor

ldacey commented Aug 20, 2020

Worked like a charm @dstandish, I only had to change self.conn_str to self.odbc_connection_string.

Really nice that I can fetch the data as a pyarrow table with this turbodbc module (normally I just dump the files to a csv file or read the data with pandas and convert into a table and save the data as a partitioned ParquetDataset - this allows me to write the dataset immediately basically)

@dstandish
Copy link
Contributor Author

nice :)

glad it worked out for you

@ldacey
Copy link
Contributor

ldacey commented Jan 25, 2021

Any chance of having the turbodbc function added into the official odbc_hook? I have a custom version now, but that is the only changed I made.

Also, has anyone had any luck with getting ODBC to work with MySQL drivers? I installed the drivers and everything exists as expected, but when I try to run anything I get a file not found error:

message: [unixODBC][Driver Manager]Can't open lib '/usr/lib/x86_64-linux-gnu/odbc/libmyodbc8w.so' : file not found

https://dev.mysql.com/doc/connector-odbc/en/connector-odbc-installation-binary-deb.html

@potiuk
Copy link
Member

potiuk commented Jan 25, 2021

@ldacey -> mabe you would like to contribute the turboodbc stuff. Happy to revfiew it ?

Re: libmyodbc error- how did you install it? And do you have that file it complains about ?

@ldacey
Copy link
Contributor

ldacey commented Jan 26, 2021

Hi - yes, the file existed and it seems like the most likely culprit was due to some dependency issues.. however, I stopped trying to install from DEB source and instead just downloaded the tar files which worked!

This is my block of code which handles the ODBC stuff for MS SQL and MySQL:

&& curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& apt-get update && apt-get -y dist-upgrade \
&& curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list \
&& apt-get update \
&& ACCEPT_EULA=Y apt-get install -y \
   msodbcsql17 \
   mssql-tools \
&& curl -SL https://dev.mysql.com/get/downloads/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit.tar.gz | tar -zxC /opt \
&& cp /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/lib/libmyodbc8* /usr/lib/x86_64-linux-gnu/odbc/ \
&& /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/bin/myodbc-installer -d -a -n "MySQL ODBC 8.0 ANSI Driver" -t "DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc8a.so;" \
&& /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/bin/myodbc-installer -d -a -n "MySQL ODBC 8.0 Unicode Driver" -t "DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc8w.so;" \

A MySQL conn_id example of the "extra" field:

{"Driver": "MySQL ODBC 8.0 Unicode Driver", "Threading": 2, "INITSTMT": "set session time_zone ='+00:00'"}

A MS SQL conn_id example:

{"Driver": "ODBC Driver 17 for SQL Server", "ApplicationIntent": "ReadOnly", "TrustedConnection": "Yes"}

I am using this 100% with the get_turbodbc_connection function that @dstandish provided and have not run into any problems yet. I could contribute this but I would feel like a fraud - dstandish provided the function and I just copy and pasted it without making any other changes.

@dstandish
Copy link
Contributor Author

I could contribute this but I would feel like a fraud - dstandish provided the function and I just copy and pasted it without making any other changes.

@ldacey do it! 🙂

@dstandish
Copy link
Contributor Author

dstandish commented Jan 27, 2021

@ldacey let me add...

Just shepherding something through the merge process is meaningful and valuable contribution in itself and this method would probably be a helpful addition.

But there also remains meaningful creative work to be done.

For example, we ought to think through whether and how to make it so MsSqlOperator can optionally use turbodbc. Right now, it will use odbc hook if conn type is odbc and original MsSqlHook otherwise. (That is, if conn uri is odbc://blah... MsSqlOperator will use OdbcHook). It's possible you could make it so if conn type is turbodbc then it uses turbodbc connection.

Or perhaps you might decide it makes more sense to contribute a turbodbc hook, which could be a thin subclass of odbc hook, changing only get_conn.

Or, maybe we just add that single method and people can use it in their own operators, like you do now 🤷‍♂️

I never got around to thinking through these decisions, which is partly i never made a pr for it.

Additionally there is the need to address documentation.

So, you can definitely contribute support for turbodbc without feeling in any way that you appropriated anything.

@eladkal
Copy link
Contributor

eladkal commented Apr 29, 2021

hi @dstandish question/thoughts on the current situation.
In order to use MsSqlOperator we are forced to install also odbc provider though we don't have any odbc connection. pyodbc cause us trouble in our CI due to mkleehammer/pyodbc#441

What is the motivation to bind the two providers? (since pymssql is maintained #11537 )

@dstandish
Copy link
Contributor Author

dstandish commented Apr 29, 2021

@eladkal for background, the original thought was that we would deprecate support for pymssql and switch to pyodbc since the pymssql project, at the time, had been abandoned, and they pushed out an intentionally broken 3.0 release.

So I added an odbc hook, and using the method Connection.get_hook, updated MsSqlOperator so that it would use the hook you chose based on your connection's conn_type (allowing you to optionally start using odbc)

But later, I think the pymssql project was resurrected, and at some point airflow decided to un-deprecate pymssql as the client for mssql.

@dstandish
Copy link
Contributor Author

OK @eladkal i think in #15594 i have successfully removed the interdependency while maintaining verification of compatibility in the mssql operator test.

Please take a look

Thanks

@ldacey
Copy link
Contributor

ldacey commented May 4, 2021

Recently added support for Snowflake so I thought I would share it in case anyone else is using turbodbc and Airflow. The nice part about this is that my data goes straight from the database to a partitioned parquet dataset without worrying about types changing (due to pandas or reading in CSV data). I can use a single "OdbcToAzureBlobOperator" for all of my SQL data sources.

Driver installation in my Dockerfile:

  && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
    && apt-get update && apt-get -y dist-upgrade \
    && curl https://packages.microsoft.com/config/ubuntu/18.04/prod.list > /etc/apt/sources.list.d/mssql-release.list \
    && apt-get update \
    && ACCEPT_EULA=Y apt-get install -y \
        msodbcsql17 \
        mssql-tools \
    && curl -SL https://dev.mysql.com/get/downloads/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit.tar.gz | tar -zxC /opt \
    && cp /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/lib/libmyodbc8* /usr/lib/x86_64-linux-gnu/odbc/ \
    && /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/bin/myodbc-installer -d -a -n "MySQL ODBC 8.0 ANSI Driver" -t "DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc8a.so;" \
    && /opt/mysql-connector-odbc-8.0.23-linux-glibc2.12-x86-64bit/bin/myodbc-installer -d -a -n "MySQL ODBC 8.0 Unicode Driver" -t "DRIVER=/usr/lib/x86_64-linux-gnu/odbc/libmyodbc8w.so;" \
    && curl -SL https://sfc-repo.azure.snowflakecomputing.com/odbc/linux/2.23.1/snowflake_linux_x8664_odbc-2.23.1.tgz | tar -zxC /opt \
    && mv /opt/snowflake_odbc /usr/lib/x86_64-linux-gnu/odbc/ \
    && /usr/lib/x86_64-linux-gnu/odbc/snowflake_odbc/iodbc_setup.sh \

Sample of Extras from each database type:

Snowflake:

{"Driver": "SnowflakeDSIIDriver", "warehouse": "warehousename", "role": "rolename"}

MS SQL:

{"Driver": "ODBC Driver 17 for SQL Server", "ApplicationIntent": "ReadOnly", "TrustedConnection": "Yes"}

MySQL:

{"Driver": "MySQL ODBC 8.0 Unicode Driver", "Threading": 2, "INITSTMT": "set session time_zone ='+00:00'"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet