New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-6296] add ODBC hook & deprecation warning for pymssql #6850
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6850 +/- ##
==========================================
- Coverage 85.41% 85.27% -0.15%
==========================================
Files 753 711 -42
Lines 39685 39503 -182
==========================================
- Hits 33898 33687 -211
- Misses 5787 5816 +29
Continue to review full report at Codecov.
|
4a56b05
to
c5245f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My 2 cents, BTW should we also change UPDATE.md
to record this change?
c5245f3
to
6bab17d
Compare
6bab17d
to
a29ad72
Compare
a29ad72
to
d494875
Compare
(Per Slack reply) |
350c48a
to
1b3b907
Compare
it's an interesting idea but i think it actually increases complexity to go that way. then your hook logic, and the design of your connection object parsing, has to be able to handle arbitrary mssql libraries. and i think that might be why conventionally in airflow it's the other way around -- hooks are defined in accordance with the idiosyncrasies of their connectors, and they can optionally generate a sqlalchemy connection. and this PR does provide support for that here: (https://github.com/apache/airflow/blob/1b3b907ed6e8a45affa269b624dfe420d74424ed/airflow/providers/mssql/hooks/mssql_odbc.py#L156) |
c7f1350
to
806b332
Compare
I'm not sure if this is the right place to ask this, but is there a reason this isn't generalised to a an I'm specifically thinking of using this with the Exasol database, and having to create another hook that has a lot of this code duplciated seems wasteful. |
@vamega I love this idea -- i thought about doing this initially ... but wasn't sure if there was need for that. i am happy to refactor this as an odbc hook. i wonder if we should still make a mssql odbc hook that is a subclass.... please let me know if you have any ideas, or if you think that there are changes that would be helpful for exasol |
@dstandish I'll take a look at the code today or tomorrow. One of the things I'd like to support is to directly pass in the ODBC connection string. That removes the need for an I'm not sure how well known this technique is, but you can see a demonstration of it here: |
@vamega the present design supports this :) You can get some details from the doc. Looking at the tests also demonstrates expected behavior. I don't like having to mess with odbc.ini either. I support usage of DSN but not require it. The basic approach though is this:
|
also @vamega ... re your comment
incidentally, i am already using this approach in the the |
806b332
to
412e6ab
Compare
a7adbeb
to
f5305f2
Compare
Alright... after further thought... i think it makes more sense to just call it odbc hook. there is no point in having a separate Having two identical hooks would be confusing for users and makes it difficult to create clear and consistent documentation. |
1eb2dbe
to
12d0a89
Compare
Did this get pushed to 2.0? A one point it was slated for 1.10.10, judging from the history of Jira. Was it pushed because of potential impact? |
Looking at the history it was problematic to import to 1.10.10 and @kaxil decided to leave it in 2.0. While trying to cherry-pick we sometimes find that it's too risky/too problematic to cherry-pick and such candidate gets dropped from the list. Do you think @gflores1023 it is needed/badly needed for 1.10 line? What's the reasoning? Is there no suitable workaround until 2.0 is available? Maybe - if we hear that it is really important we might resume and make some extra effort to cherry-pick to 1.10.11? Or maybe - if it is really important but only for you - you could try to make a PR against v1-10-tests that you could work on and we could review? |
Yes, this was indeed pushed back to 2.0. Like @potiuk mentioned is there a specific need for this operator which the current operators in 1.10.10 can't fulfill ? |
one reason i thought it would be nice to cherry pick this is so that users could switch to odbc (from pymssql) ahead of, and independent of, the 2.0 upgrade. i suspect that the reason you guys did not want to cherry pick is because it introduces deprecation of the pymssql-based mssqlhook, and therefore, without intervention, implicitly deprecates MSSQLToGCSOperator and MsSqlToHiveTransfer? if so, @gflores1023, this is another way you could potentially contribute and help move this along. the only thing that needs to be sorted out is the type map --- from the pyodbc return types to gcs / hive. then these operators can be adapted to work with odbc hook also, in the same manner as was done with MsSqlOperator. but i dunno maybe there was also some unrelated ambivalence about deprecating pymssql spport, or other unrelated concerns with the hook? |
Yes. Why not. I will mark it as 1.10.11 - and I need to finish up the CI changes to think about adding 3.8 - we already exceed capacity for DockerHub to start running 3.8 tests, but after few more changes I am working on (and moving Kubernetes tests to Github Actions) we might want to think about adding 3.8 support officially (and maybe even backporting it to 1.10.* - but I would rather not do it for 1.10 to add more incentive to move to 2.0 ;). |
ah, the mythical 2.0 ;) |
Yeah we can try to make Airflow 1.10.* Py 3.8 compatible but without any gurantees as the Test Suite is already very large for 1.10.* . We still support Python 2 :( |
Is it possible to use turbodbc with this as well, or would that have to be a separate custom hook? |
Support could be added easily because the odbc conn str can be reused. Internally what I did was add this method so you can choose turbodbc client optionally: def get_turbodbc_connection(self, turbodbc_options: Optional[dict] = None):
# imported locally so turbodbc can be optional
import turbodbc
from turbodbc import make_options
from turbodbc_intern import Megabytes
default_options_kwargs = dict(
read_buffer_size=Megabytes(300),
large_decimals_as_64_bit_types=True,
prefer_unicode=True,
autocommit=True,
)
turbodbc_options = make_options(**{**default_options_kwargs, **(turbodbc_options or {})})
return turbodbc.connect(connection_string=self.odbc_connection_string, turbodbc_options=turbodbc_options) |
Thanks for the example! Perfect, let me try that as well. |
Worked like a charm @dstandish, I only had to change self.conn_str to self.odbc_connection_string. Really nice that I can fetch the data as a pyarrow table with this turbodbc module (normally I just dump the files to a csv file or read the data with pandas and convert into a table and save the data as a partitioned ParquetDataset - this allows me to write the dataset immediately basically) |
nice :) glad it worked out for you |
Any chance of having the turbodbc function added into the official odbc_hook? I have a custom version now, but that is the only changed I made. Also, has anyone had any luck with getting ODBC to work with MySQL drivers? I installed the drivers and everything exists as expected, but when I try to run anything I get a file not found error:
https://dev.mysql.com/doc/connector-odbc/en/connector-odbc-installation-binary-deb.html |
@ldacey -> mabe you would like to contribute the turboodbc stuff. Happy to revfiew it ? Re: libmyodbc error- how did you install it? And do you have that file it complains about ? |
Hi - yes, the file existed and it seems like the most likely culprit was due to some dependency issues.. however, I stopped trying to install from DEB source and instead just downloaded the tar files which worked! This is my block of code which handles the ODBC stuff for MS SQL and MySQL:
A MySQL conn_id example of the "extra" field:
A MS SQL conn_id example:
I am using this 100% with the |
@ldacey do it! 🙂 |
@ldacey let me add... Just shepherding something through the merge process is meaningful and valuable contribution in itself and this method would probably be a helpful addition. But there also remains meaningful creative work to be done. For example, we ought to think through whether and how to make it so Or perhaps you might decide it makes more sense to contribute a turbodbc hook, which could be a thin subclass of odbc hook, changing only get_conn. Or, maybe we just add that single method and people can use it in their own operators, like you do now 🤷♂️ I never got around to thinking through these decisions, which is partly i never made a pr for it. Additionally there is the need to address documentation. So, you can definitely contribute support for turbodbc without feeling in any way that you appropriated anything. |
hi @dstandish question/thoughts on the current situation. What is the motivation to bind the two providers? (since pymssql is maintained #11537 ) |
@eladkal for background, the original thought was that we would deprecate support for pymssql and switch to pyodbc since the pymssql project, at the time, had been abandoned, and they pushed out an intentionally broken 3.0 release. So I added an odbc hook, and using the method But later, I think the pymssql project was resurrected, and at some point airflow decided to un-deprecate pymssql as the client for mssql. |
Recently added support for Snowflake so I thought I would share it in case anyone else is using turbodbc and Airflow. The nice part about this is that my data goes straight from the database to a partitioned parquet dataset without worrying about types changing (due to pandas or reading in CSV data). I can use a single "OdbcToAzureBlobOperator" for all of my SQL data sources. Driver installation in my Dockerfile:
Sample of Extras from each database type: Snowflake:
MS SQL:
MySQL:
|
Changes
apache-airflow[odbc]
(requires pyodbc)MsSqlOperator
is updated to use either MsSql hook depending onconn_type
.Connection.get_hook
, which can return a different hook based on conn type.conn_type='odbc'
thenMsSqlOdbcHook
will be used; otherwise, theMsSqlHook
is used.hook_type
parameterJira
Description
see above
Tests
Commits
Documentation