Skip to content

Conversation

@maxispeicher
Copy link
Contributor

@maxispeicher maxispeicher commented Jan 1, 2021

Issue #356:

Description of changes:

Adding support for Microsoft SQL Server which should be equivalent to the PostgreSQL or MySQL support. For handling the connection to SQL Server databases, the pymssql package is used. Apart from the pure feature implementation some adaptions were needed:

  • Parsing of the JDBC URL in awswrangler/_databases.py needs to be adapted to the SQL Server Syntax for the database name
  • read_sql_query in awswrangler/_databases.py needed to be adapted because the pymssql.Cursor object could not be handed over to a different function. Because of that the cursor object is first created in the functions which directly query the database.
  • Decoding the databases passwords in awswrangler/catalog/_get.py to a str from bytes because pymssql.connect() can't handle a byte string as password
  • In tests change the name of the double column to ddouble, because double is protected in SQL Server.

Additionally the raise keyword was added for some exceptions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@igorborgest igorborgest added this to the 2.3.0 milestone Jan 2, 2021
@igorborgest igorborgest self-requested a review January 2, 2021 12:39
Copy link
Contributor

@igorborgest igorborgest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @maxispeicher !

I just want to highlight some issues with the new dependency on pymssql.

  • It had started a deprecation process last year that was reverted later. Even though, the main developer leaved the project and honestly I don't know how much active the current developer(s) are.
  • It's under the LGPL license that has been facing some barriers to be used internally in some companies.
  • Microsoft explicit endorse and recommend the use of pyodbc (Reference):

    There are several python SQL drivers available. However, Microsoft places its testing efforts and its confidence in pyodbc driver.

So, I didn't dig to deep in this universe yet but seems that pyodbc would be a better candidate. But we would still need to figure out how to fit it inside the library because it has a external dependency on the ODBC driver that must be installed independently and the Lambda Layer / AWS Glue whl will not support it out-of-the-box.

But first let's try to evaluate pyodbc as a possible dependency and then we invest some energy trying to understand the best way to distribute it. (Probably just a good documentation about the limitation and some reference about how to install the driver).

What do you think?

@danielwo FYI

@igorborgest
Copy link
Contributor

We should maybe add pyodbc as an optional dependency...

@maxispeicher
Copy link
Contributor Author

I also was unsure about whether to use pymssql or pyodbc. I've decided against pyodbc because of the additional dependency on the ODBC driver, but I also see the issues with pymssql.
Maybe we could add a extra to the package for using sqlserver which installs pyodbc

@igorborgest
Copy link
Contributor

igorborgest commented Jan 2, 2021

Yeah, it sounds like the best plan by now. Pyodbc as an extra dependency will keep the same requirements for current users that don't need MS SQLServer. And for those that want to use it, they already need to install the driver by themselves, an extra dependency during the pip install would be nothing.

Let's wait for the @danielwo opinion too, he has more experience running on restrict environments than me.

@maxispeicher
Copy link
Contributor Author

Sounds good to me. I will do some local testing meanwhile, to see if everything works as intended when using pyodbc instead of pymssql.

@igorborgest igorborgest requested a review from danielwo January 2, 2021 15:19
@maxispeicher
Copy link
Contributor Author

So pyodbc seems to work just fine aswell (if you have the driver installed). I've implemented the swap on a new branch, if you want to have a look: https://github.com/maxispeicher/aws-data-wrangler/tree/swap-to-pyodbc.
However it's still missing documentation (and checking if there are remainders from pymssql)

@igorborgest
Copy link
Contributor

This new branch seems great. I only see one issue:

We will probably need to import pyodbc dynamically or figure out some other strategy to only import it when it is installed. Otherwise, the way it is implemented right now, Wrangler will always try to import it during the import awswrangler.

@igorborgest
Copy link
Contributor

Previously, AWS Data Wrangler had support for Apache Spark, and this was the way we were handling this kind of situation.

P.S. Regardless to a specific solution we must ensure that the autocomplete will work properly for wr.sqlserver.*.

@maxispeicher
Copy link
Contributor Author

I've tried to implement a solution which basically always imports the sqlserver module, but only imports pyodbc when it is available. As soon as a public function from the sqlserver module is called and pyodbc is not installed a ModuleNotFoundError is raised with a hint that pyodbc or awswrangler with the sqlserver extra needs to be installed.

It's also pushed to the swap-to-pyodbc branch.

@igorborgest
Copy link
Contributor

Really cool. I've liked this approach!

@maxispeicher
Copy link
Contributor Author

For now I've managed to adapt the build process of the Lambda Layers to include the needed ODBC driver and pyodbc. However I couldn't find a way to do it for the Glue part.

@maxispeicher maxispeicher marked this pull request as draft January 4, 2021 16:18
@danielwo danielwo marked this pull request as ready for review January 4, 2021 18:14
@danielwo danielwo marked this pull request as draft January 4, 2021 18:17
@igorborgest
Copy link
Contributor

@maxispeicher the lambda layer is technically perfect, but we will not be able to distribute this layers containing the MS proprietary driver. The best we will be able to do here is to give the instructions (docs) to let users create the layer by themselves.

And with this scenario in mind I would prefer to instruct them how to create a smaller "sidecar" layer that will only contain the driver and the config files. This way they will be able to use our pre-built layer + their own layer with the missing driver.

But regardless the way we will help users to use this driver on lambda/glue, let's first merge a PR without it. The pure implementation with pyodbc already has a lot of value for users running in others platforms. Then you could open a second PR only to implement/discuss the driver distribution subject.

What do you think?

@maxispeicher
Copy link
Contributor Author

Yes sounds good to me. That's what I already feared regarding the ODBC driver. I've tried to add a little bit of documentation regarding Microsoft SQL Server and the required setup in the Install section of the docs. Hopefully it helps a little bit.

Additionally I've reverted any changes to the Lambda Layer build process.

I think from my side this PR is good to go in this state.

@maxispeicher maxispeicher marked this pull request as ready for review January 4, 2021 22:01
Copy link
Contributor

@danielwo danielwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I like the dynamic import + decorator.

Copy link
Contributor

@igorborgest igorborgest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some tiny observations.

@igorborgest igorborgest self-assigned this Jan 5, 2021
Copy link
Contributor

@igorborgest igorborgest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @maxispeicher !

@igorborgest igorborgest merged commit 6172b4c into aws:master Jan 5, 2021
@gtossou
Copy link

gtossou commented Jan 5, 2021

Thanks for working on the feature guys

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants