Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create SQL Alchemy connection string using Okta IdP #67

Closed
namtonthat opened this issue Nov 11, 2021 · 13 comments
Closed

Create SQL Alchemy connection string using Okta IdP #67

namtonthat opened this issue Nov 11, 2021 · 13 comments

Comments

@namtonthat
Copy link

Driver version

redshift_connector 2.0.889

Redshift version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.31651

Client Operating System

mac OS Monterery

Python version

3.9.7

Problem description

import redshift_connector

# An Okta auth page will pop up when initiating this connector
conn = redshift_connector.connect(
    host='<cluster_identifier>.ap-southeast-2.redshift.amazonaws.com',
    port=5439,
    region='ap-southeast-2',
    database='<db_name>',
    credentials_provider='BrowserSamlCredentialsProvider',
    login_url='<saml_url>',
    cluster_identifier='<cluster_identifer',
    preferred_role='arn:aws:iam::<iam-id>:role/<role-name>',
    user='',        # left empty string
    password='',    # left empty string
    iam=True,
    ssl=True
 )

cursor: redshift_connector.Cursor = conn.cursor()

Currently, I can connect to Redshift using the above credentials but when trying to utilise the SQL Alchemy connection string - it causes an identification error (using sqlalchemy-redshift). Is there a way to generate an appropriate URL for sqlalchemy to parse for IdP authentication?

I've noticed that Redshift was recently updated to work with sqlalchemy-redshift but have had no luck generating an appropriate URL login.

@Brooke-white
Copy link
Contributor

Hey @nam-tonthat-afterpay ,

Below is an example showing how to establish a connection to Redshift using Okta Browser authentication. Please note that the user and password field are no longer required (i.e. as you're using the browser authentication plugin they can be excluded).

import sqlalchemy as sa
from sqlalchemy.engine.url import URL

url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
)

conn_params = {
"ssl": True,
"iam": True, # must be enabled when authenticating via IAM
"credentials_provider": "BrowserSamlCredentialsProvider",
"database": '<db_name>', # Amazon Redshift database
"region": "ap-southeast-2",
"cluster_identifier": "<cluster_identifer>",
"login_url": "<saml_url>"
}

engine = sa.create_engine(url, connect_args=conn_params)

with engine.connect() as connection:
    result = connection.execute("select 1")
    for row in result:
        print(row)

engine.dispose()

Please let me know if you see any issue or have any questions.

@namtonthat
Copy link
Author

namtonthat commented Nov 12, 2021

Hey @Brooke-white,

I appreciate that! That works running it by itself but once integrated into Amundsen where the botocore/client.py takes the api_call, it returns this error.

File "/Users/nam.tonthat/miniconda/envs/amundsen/lib/python3.9/site-packages/redshift_connector/iam_helper.py", line 371, in set_cluster_credentials
    response = client.describe_clusters(ClusterIdentifier=info.cluster_identifier)
  File "/Users/nam.tonthat/miniconda/envs/amundsen/lib/python3.9/site-packages/botocore/client.py", line 386, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/nam.tonthat/miniconda/envs/amundsen/lib/python3.9/site-packages/botocore/client.py", line 705, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClusterNotFoundFault: An error occurred (ClusterNotFound) when calling the DescribeClusters operation: 

Uses the sql_alchemy_extractor.py found here

More context, I've used the following versions:

  • amundsen-common = 0.21.0
  • amundsen-databuilder = 6.3.1
  • amundsen-rds = 0.0.6
  • sqlalchemy = 1.4.22
  • sqlalchemy-redshift = 0.8.8
  • botocore = 1.23.2

Any idea on how I should start debugging this issue?

@Brooke-white
Copy link
Contributor

Hi @nam-tonthat-afterpay ,

Thanks for providing this additional information. I'd recommend to try the following:

  • Enable debug logging for redshift_connector and verify nothing looks suspicious in the logs (e.g. region, cluster_identifier)
  • Are the cluster_identifier and region used for establishing the connection inline with what is seen in the debug logs? As a sanity check, does the cluster exist in the given region?

The error provided is being thrown after the authentication process w/ Okta has completed (i.e. we have temporary IAM credentials), when redshift_connector tries to retrieve the host and port of the cluster with the given cluster_identifier. It's a good sign we're able to check if the cluster if cluster_identifier exists or not.

In my experience, this exception occurs when authentication has been successful but there is some mis-match between cluster_identifier and region.

If you still see issue after trying the above, please let me know and I will work to reproduce on my end.

@namtonthat
Copy link
Author

Hey @Brooke-white, do you have any tutorials on how to enable debug logging for redshift_connector? I'm still newish to logging.

I've tried using the Python tutorial but haven't had much luck.

@Brooke-white
Copy link
Contributor

Hey @nam-tonthat-afterpay ,

Apologies for the late response. Here's an example of configuring debug logging to output to stdout.

This would likely be easier that outputting the logs to a file given your Amundsen instance is probably running in a docker container, and a log file would show up there rather than your local file system :)

import redshift_connector
import logging
import sys

root = logging.getLogger()
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)  # will write all logs with a level >= DEBUG
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)

with redshift_connector.connect(
...
) as conn:
    with conn.cursor() as cursor:
        cursor.execute("select 1")
        print(cursor.fetchall())

For me, this outputs:

2021-11-19 09:31:02,869 - redshift_connector - DEBUG - ===================================
2021-11-19 09:31:02,869 - redshift_connector - DEBUG - User provided connection arguments
2021-11-19 09:31:02,869 - redshift_connector - DEBUG - ===================================
2021-11-19 09:31:02,869 - redshift_connector - DEBUG - {{'access_key_id': None, 'allow_db_user_override': False, 'app_id': None, 'app_name': 'amazon_aws_redshift',....}

ref: https://stackoverflow.com/a/14058475

@namtonthat
Copy link
Author

namtonthat commented Nov 26, 2021

Hey @Brooke-white, thanks for the heads up; I've done some more testing with this and the cluster_identifer and region look correct but then when the query starts; it returns the ClusterNotFound error.

2021-11-26 15:16:40,716 - botocore.endpoint - DEBUG - Making request for OperationModel(name=DescribeClusters) with params: {'url_path': '/', 'query_string': '', 'method': 'POST', 'headers': {'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8', 'User-Agent': 'Boto3/1.18.21 Python/3.9.7 Darwin/21.1.0 Botocore/1.21.41'}, 'body': {'Action': 'DescribeClusters', 'Version': '2012-12-01', 'ClusterIdentifier': 'dwh'}, 'url': 'https://redshift.ap-southeast-2.amazonaws.com/', 'context': {'client_region': 'ap-southeast-2', 'client_config': <botocore.config.Config object at 0x7fa1685b5040>, 'has_streaming_input': False, 'auth_type': None}}
2021-11-26 15:16:40,716 - botocore.hooks - DEBUG - Event request-created.redshift.DescribeClusters: calling handler <bound method RequestSigner.handler of <botocore.signers.RequestSigner object at 0x7fa1685aff70>>
2021-11-26 15:16:40,716 - botocore.hooks - DEBUG - Event choose-signer.redshift.DescribeClusters: calling handler <function set_operation_specific_signer at 0x7fa168342b80>
2021-11-26 15:16:40,717 - botocore.auth - DEBUG - Calculating signature using v4 auth.
2021-11-26 15:16:40,717 - botocore.auth - DEBUG - CanonicalRequest:
POST
/
2021-11-26 15:16:41,091 - urllib3.connectionpool - DEBUG - https://redshift.ap-southeast-2.amazonaws.com:443 "POST / HTTP/1.1" 404 287
2021-11-26 15:16:41,092 - botocore.parsers - DEBUG - Response headers: {'x-amzn-RequestId': '934b5778-7693-4be6-ae70-7efdbdcf975b', 'Content-Type': 'text/xml', 'Content-Length': '287', 'Date': 'Fri, 26 Nov 2021 04:16:40 GMT'}
2021-11-26 15:16:41,092 - botocore.parsers - DEBUG - Response body:
b'<ErrorResponse xmlns="http://redshift.amazonaws.com/doc/2012-12-01/">\n  <Error>\n    <Type>Sender</Type>\n    <Code>ClusterNotFound</Code>\n    <Message>Cluster dwh not found.</Message>\n  </Error>\n  <RequestId>934b5778-7693-4be6-ae70-7efdbdcf975b</RequestId>\n</ErrorResponse>\n'
2021-11-26 15:16:41,094 - botocore.parsers - DEBUG - Response headers: {'x-amzn-RequestId': '934b5778-7693-4be6-ae70-7efdbdcf975b', 'Content-Type': 'text/xml', 'Content-Length': '287', 'Date': 'Fri, 26 Nov 2021 04:16:40 GMT'}
2021-11-26 15:16:41,095 - botocore.parsers - DEBUG - Response body:
b'<ErrorResponse xmlns="http://redshift.amazonaws.com/doc/2012-12-01/">\n  <Error>\n    <Type>Sender</Type>\n    <Code>ClusterNotFound</Code>\n    <Message>Cluster dwh not found.</Message>\n  </Error>\n  <RequestId>934b5778-7693-4be6-ae70-7efdbdcf975b</RequestId>\n</ErrorResponse>\n'
2021-11-26 15:16:41,095 - botocore.hooks - DEBUG - Event needs-retry.redshift.DescribeClusters: calling handler <botocore.retryhandler.RetryHandler object at 0x7fa1685b5970>
2021-11-26 15:16:41,095 - botocore.retryhandler - DEBUG - No retry needed.
2021-11-26 15:16:41,096 - redshift_connector.iam_helper - ERROR - ClientError: An error occurred (ClusterNotFound) when calling the DescribeClusters operation: Cluster dwh not found.

Any ideas for this error?

@Brooke-white
Copy link
Contributor

Hey @nam-tonthat-afterpay ,

Could you verify that IAM SAML federation role was created inline with the instructions here:

To set up the service provider, complete steps 5–7 from Federate Amazon Redshift access with Okta as an identity provider.

For step 7, the Amazon Redshift connector needs the following additional permission:

{
            "Sid": "DescribeClusters",
            "Effect": "Allow",
            "Action": "redshift:DescribeClusters",
            "Resource": "*"
}

If the role does not provide permissions to DescribeClusters, this could prevent us from being able to see the dwh cluster.

@namtonthat
Copy link
Author

namtonthat commented Dec 1, 2021

Hi @Brooke-white ,
I have checked and found the permission to allow the access to Action: redshift:DescribeClusters. I've done some debugging and made some progress but am thrown with another error.

On a simple query like this, it returns the ClusterNotFound error on the first run but on subsequent runs, it always returns (1,) as the output which is fantastic.

import redshift_connector
import logging
import sys

root = logging.getLogger()
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)  # will write all logs with a level >= DEBUG
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)



import sqlalchemy as sa
from sqlalchemy.engine.url import URL

url = URL.create(
drivername='redshift+redshift_connector', # indicate redshift_connector driver and dialect will be used
)

conn_params = {
"ssl": True,
"iam": True, # must be enabled when authenticating via IAM
"credentials_provider": "BrowserSamlCredentialsProvider",
"database": 'vega', # Amazon Redshift database
"region": "ap-southeast-2",
"cluster_identifier": "dwh",
"login_url": "<okta_login>"
}

engine = sa.create_engine(url, connect_args=conn_params)

with engine.connect() as connection:
    result = connection.execute("select 1")
    for row in result:
        print(row)

engine.dispose()

However, repeating it for the amundsen query (adjusting the sample_postgres_loader.py to suit), will always cause a ClusterNotFound connection error.

Any ideas on what could be the issue?

@Brooke-white
Copy link
Contributor

Hi @nam-tonthat-afterpay,

Can you confirm if the Python script which only uses redshift_connector is now consistently connecting?

On a simple query like this, it returns the ClusterNotFound error on the first run but on subsequent runs, it always returns (1,) as the output which is fantastic.

When this behavior occurs, what environment(s) are used to run the Python script (e.g. Jupyter)?

However, repeating it for the amundsen query (adjusting the sample_postgres_loader.py to suit), will always cause a ClusterNotFound connection error.

Could you share the modified sample_postgres_loader.py so I can work on reproducing this behavior? Also, how is sample_postgres_loader.py being run? Is the idea that it is manually invoked or done so as a scheduled task?

@Brooke-white
Copy link
Contributor

Hi @nam-tonthat-afterpay ,

Following up to see if you're still experiencing issues here.

@namtonthat
Copy link
Author

Hi @Brooke-white , yes I am - I'm just a little tied up with some tickets that I have to close off - will update this thread by tomorrow! Apologies for the wait :'(.

@namtonthat
Copy link
Author

Hi @Brooke-white, did a fresh install this morning and tried. Looks like everything is working perfectly. Closing the thread. The debug logging helped me iron out any issues. Thank you again for your help and patience on this!

@Brooke-white
Copy link
Contributor

yay! I'm happy to hear this, @nam-tonthat-afterpay. Best wishes on your project :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants