Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add azure blob storage filesystem/staging destination #592

Merged
merged 16 commits into from
Sep 1, 2023

Conversation

steinitzu
Copy link
Collaborator

Resolves #560

@netlify
Copy link

netlify bot commented Aug 26, 2023

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 24e4c2e
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/64f0db279ebce400087cf4d6

@steinitzu steinitzu marked this pull request as ready for review August 27, 2023 15:49
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments. please check

dlt/common/configuration/specs/azure_credentials.py Outdated Show resolved Hide resolved

def to_adlfs_credentials(self) -> Dict[str, Any]:
"""Return a dict that can be passed as kwargs to adlfs"""
return dict(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adlfs does not need this token. (maybe it does that itself?)

fs = fsspec.filesystem("adl", account_name=adls_account_name, account_key=adls_account_key)
print(fs.ls("dlt-ci-test-bucket/"))

works for me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the SAS is needed for snowflake when not using named stage https://github.com/dlt-hub/dlt/pull/592/files#diff-577441894b0e65b848310ea282c52bab00b6f649eb2c1b9402a5423d70e4d3e4R84
Otherwise we don't need it

dlt/destinations/filesystem/configuration.py Show resolved Hide resolved
dlt/destinations/filesystem/filesystem_client.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@rudolfix
Copy link
Collaborator

@sh-rp please take a look

  • if tests setup is OK
  • please add required secrets to CI (check 1password)
  • please setup required Snowflake stage
  • update documentation if needed

thanks!

@rudolfix
Copy link
Collaborator

@mjducutoae so blob storage will be available in alpha most probably on Tuesday. you can take a look at the implementation here. I think it is fairly standard way of using adlfs

@sh-rp
Copy link
Collaborator

sh-rp commented Aug 30, 2023

@steinitzu: when providing the AZ credentials I am getting

dlt.destinations.exceptions.DatabaseTerminalException: 091003 (22000): 01aea694-3201-ee94-0002-0a0a026d6e8e: Failure using stage area. Cause: [The specifed resource name contains invalid characters. (Status Code: 400; Error Code: InvalidResourceName)]

when running "pytest tests/load/pipeline/test_stage_loading.py::test_all_data_types" with the "az-authorization" destination config. Can you see this locally too? I can't quite figure out which resource name is invalid.

@sh-rp
Copy link
Collaborator

sh-rp commented Aug 30, 2023

I have set up the az integration on snowflake and this seems to work, we only have to figure out this error for direct authentication.

@steinitzu
Copy link
Collaborator Author

@sh-rp my bad. I was stripping out the bucket name when generating the snowflake compatible URL. Fixed in last commit, runs for me locally now.

@sh-rp
Copy link
Collaborator

sh-rp commented Aug 30, 2023

@steinitzu perfect, this works now. There is one failing test remaining which is:

pytest tests/load/pipeline/test_filesystem_pipeline.py::test_pipeline_merge_write_disposition

I am getting

('Failed to remove %s for %s', ['dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/1693412282.432455.178b58ce42.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/1693412284.13063.fdec11df2b.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data/1693412284.13063.a5a78fdcf1.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data/1693412284.13063.f72071cafb.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412282.432455', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412283.199035', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412284.13063'], ResourceExistsError('This operation is not permitted on a non-empty directory.\nRequestId:d7e03825-801e-002d-305d-db4ce4000000\nTime:2023-08-30T16:18:06.3619677Z\nErrorCode:DirectoryIsNotEmpty'))

On the console for that, so maybe you're deleting a non empty dir which fails or something like that.

@sh-rp
Copy link
Collaborator

sh-rp commented Aug 30, 2023

One additional thing, azure is super verbose on the console even if the tests do not fail, maybe there is a way to limit that a bit?

@steinitzu
Copy link
Collaborator Author

@steinitzu perfect, this works now. There is one failing test remaining which is:

pytest tests/load/pipeline/test_filesystem_pipeline.py::test_pipeline_merge_write_disposition

I am getting

('Failed to remove %s for %s', ['dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/1693412282.432455.178b58ce42.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/_dlt_pipeline_state/1693412284.13063.fdec11df2b.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/other_data/1693412284.13063.a5a78fdcf1.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data/', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_data/1693412284.13063.f72071cafb.jsonl', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412282.432455', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412283.199035', 'dlt-ci-test-bucket/test_23db9bc8501f4dc572e0c89497831423/some_source._dlt_loads.1693412284.13063'], ResourceExistsError('This operation is not permitted on a non-empty directory.\nRequestId:d7e03825-801e-002d-305d-db4ce4000000\nTime:2023-08-30T16:18:06.3619677Z\nErrorCode:DirectoryIsNotEmpty'))

On the console for that, so maybe you're deleting a non empty dir which fails or something like that.

I couldn't replicate this specific error. But I was getting fail in this test because of the listings cache, so I disabled this for all fs clients f47473d
Hopefully everything works now 🙏

@steinitzu
Copy link
Collaborator Author

One additional thing, azure is super verbose on the console even if the tests do not fail, maybe there is a way to limit that a bit?

The root logger is always set to INFO. I can't figure out why, happens for me regardless of what's set in dlt config.
Not sure if it's dlt or some other package setting it?

This particular log can be disabled with:

log = logging.getLogger('azure.core.pipeline.policies.http_logging_policy')
log.setLevel(logging.WARNING)

But imo this is best solved by not overriding the root log level, then this verbosity is opt-in.

@rudolfix
Copy link
Collaborator

@steinitzu hmmmm we do not touch the root logger. dlt was logging to root but that was removed several months ago.
my take is that pytest is doing something. we have following setting

[pytest]
python_paths= dlt
norecursedirs= .direnv .eggs build dist
addopts= -v --showlocals --durations 10
xfail_strict= true
log_cli= 1
log_cli_level= INFO
python_files = test_*.py *_test.py

log_cli_level is probably messing up with root logger

during the tests several logs are throttled in tests/conftest.py ie

    # disable sqlfluff logging
    for log in ["sqlfluff.parser", "sqlfluff.linter", "sqlfluff.templater", "sqlfluff.lexer"]:
        logging.getLogger(log).setLevel("ERROR")

    # disable snowflake logging
    for log in ["snowflake.connector.cursor", "snowflake.connector.connection"]:
        logging.getLogger(log).setLevel("ERROR")

@steinitzu
Copy link
Collaborator Author

Hey @rudolfix
I'm getting this outside pytest too. Something happens when calling dlt.pipeline(), after the log level is changed:

import dlt
import logging

log = logging.getLogger()

print(log)

dlt.pipeline('my_pipe')

print(log)

output ->

<RootLogger root (WARNING)>
<RootLogger root (INFO)>

@steinitzu
Copy link
Collaborator Author

It's coming from airflow, just from being installed. export AIRFLOW__LOGGING__LOGGING_LEVEL=WARNING overrides it

@rudolfix rudolfix merged commit c945650 into devel Sep 1, 2023
36 checks passed
@rudolfix rudolfix deleted the sthor/azure-blob branch September 1, 2023 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure Blob Storage as Destination
3 participants