Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] v1.3.4: Invalid bucket name "" #113

Closed
rumbin opened this issue Jan 17, 2023 · 10 comments
Closed

[bug] v1.3.4: Invalid bucket name "" #113

rumbin opened this issue Jan 17, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@rumbin
Copy link

rumbin commented Jan 17, 2023

Description

On dbt-athena-community==1.3.4, when running something like dbt run -s +my_model, the run fails with error messages of this kind:

11:57:16  Parameter validation failed:
11:57:16  Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

So it seems that the bucket location is set to an empty string.

When switching to dbt-athena-community==1.3.3 and keeping everything else as is, the same command succeeds without any errors.

This issue can be observed on both Athena v2 and v3.

This is my profiles.yml:

athena:
  outputs:
    dev:
      type: athena
      s3_staging_dir: s3://aws-athena-query-results-eu-central-1-9999999999999/dbt/
      s3_data_dir: s3://aws-athena-tables-data/dbt/bumblebee/dev
      s3_data_naming: schema_table_unique
      region_name: eu-central-1
      database: awsdatacatalog
      schema: dbt_dev_philipp
      work_group: whatever
      num_retries: 1
      threads: 8
@nicor88
Copy link
Member

nicor88 commented Jan 18, 2023

@rumbin could you attach the filename that is raising this exception? Should be available in the raise exception from the adapter.

Possible change affecting this could be -> https://github.com/dbt-athena/dbt-athena/pull/88/files#diff-629c67ee6aeee24555537b786b7560cbf4d17496c1388aa932de34081c76f668R178-R188

@nicor88 nicor88 added the bug Something isn't working label Jan 18, 2023
@rumbin
Copy link
Author

rumbin commented Jan 18, 2023

@nicor88

16:28:46.345016 [debug] [Thread-1  ]: Began running node model.bumblebee.health_errors
16:28:46.345295 [info ] [Thread-1  ]: 1 of 1 START sql table model dbt_dev_philipp.health_errors ..................... [RUN]
16:28:46.345760 [debug] [Thread-1  ]: Acquiring new athena connection "model.bumblebee.health_errors"
16:28:46.346020 [debug] [Thread-1  ]: Began compiling node model.bumblebee.health_errors
16:28:46.346193 [debug] [Thread-1  ]: Compiling model.bumblebee.health_errors
16:28:46.350230 [debug] [Thread-1  ]: Writing injected SQL for node "model.bumblebee.health_errors"
16:28:46.350664 [debug] [Thread-1  ]: finished collecting timing info
16:28:46.350833 [debug] [Thread-1  ]: Began executing node model.bumblebee.health_errors
16:28:46.361999 [debug] [Thread-1  ]: Opening a new connection, currently in state closed
16:28:46.603291 [debug] [Thread-1  ]: finished collecting timing info
16:28:46.603554 [debug] [Thread-1  ]: On model.bumblebee.health_errors: Close
16:28:46.604141 [error] [Thread-1  ]: Unhandled error while executing model.bumblebee.health_errors
Parameter validation failed:
Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
16:28:46.604332 [debug] [Thread-1  ]: 
Traceback (most recent call last):
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/task/base.py", line 385, in safe_run
    result = self.compile_and_execute(manifest, ctx)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/task/base.py", line 338, in compile_and_execute
    result = self.run(ctx.node, manifest)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/task/base.py", line 429, in run
    return self.execute(compiled_node, manifest)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/task/run.py", line 281, in execute
    result = MacroGenerator(
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/clients/jinja.py", line 326, in __call__
    return self.call_macro(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/clients/jinja.py", line 253, in call_macro
    return macro(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 55, in macro
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/clients/jinja.py", line 326, in __call__
    return self.call_macro(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/clients/jinja.py", line 253, in call_macro
    return macro(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 22, in macro
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/adapters/athena/impl.py", line 131, in clean_up_table
    self._delete_from_s3(client, s3_location)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/adapters/athena/impl.py", line 156, in _delete_from_s3
    if self._s3_path_exists(client, bucket_name, prefix):
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/dbt/adapters/athena/impl.py", line 193, in _s3_path_exists
    response = client.session.client(
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/client.py", line 919, in _make_api_call
    request_dict = self._convert_to_request_dict(
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/client.py", line 987, in _convert_to_request_dict
    api_params = self._emit_api_params(
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/client.py", line 1026, in _emit_api_params
    self.meta.events.emit(
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/Users/philippleufke/.pyenv/versions/3.9.12/lib/python3.9/site-packages/botocore/handlers.py", line 285, in validate_bucket_name
    raise ParamValidationError(report=error_msg)
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
16:28:46.610316 [debug] [Thread-1  ]: Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '4abb105a-6d2c-454d-986f-1a01c4621093', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x11d192eb0>]}
16:28:46.610728 [error] [Thread-1  ]: 1 of 1 ERROR creating sql table model dbt_dev_philipp.health_errors ............ [ERROR in 0.26s]
16:28:46.611110 [debug] [Thread-1  ]: Finished running node model.bumblebee.health_errors
16:28:46.612337 [debug] [MainThread]: Acquiring new athena connection "master"
16:28:46.612839 [info ] [MainThread]: 
16:28:46.612998 [info ] [MainThread]: Finished running 1 table model in 0 hours 0 minutes and 2.48 seconds (2.48s).
16:28:46.613157 [debug] [MainThread]: Connection 'master' was properly closed.
16:28:46.613311 [debug] [MainThread]: Connection 'model.bumblebee.health_errors' was properly closed.
16:28:46.706722 [info ] [MainThread]: 
16:28:46.706891 [info ] [MainThread]: Completed with 1 error and 0 warnings:
16:28:46.707026 [info ] [MainThread]: 
16:28:46.707158 [error] [MainThread]: Parameter validation failed:
16:28:46.707277 [error] [MainThread]: Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

@nicor88
Copy link
Member

nicor88 commented Jan 19, 2023

@rumbin the issue is inside _delete_from_s3, this function was added in 1.3.4. The function call _parse_s3_path, that then pass the result to _s3_path_exists, somehow checking on your full error trace, the bucket that is returned is empty.
This means that somehow the location property for your model is not configured right.
I'm currently using 1.3.4 in production without issues.

I suspect that there could be a potential misconfiguration in your model, your profiles.yml looks right.

Now, could you provide to me the config that you use in your model?

@rumbin
Copy link
Author

rumbin commented Jan 20, 2023

@nicor88

The model config is very standard:

{{
    config(
        materialized='table'
        , tags=['daily']
    )
}}

The folder/schema config in dbt_project.yml is also not quite special:

models:
  +incremental_strategy: "insert_overwrite"
  bumblebee:
    level_1:
      schema: data_warehouse_l1
      +tags: l1
    level_2:
      schema: data_warehouse_l2
      +tags: l2
      +meta:
        # BI integration setings (Superset): override in model YAMLs, if needed
        model_maturity: high 
        certification:
          certified_by: Business Intelligence Team
          details: dbt-managed Level 2 (L2) model
        owners:
          # User IDs of Superset's internal database:
          - 5 # Philipp

The only thing that comes to my mind, which might interfere is this override macro:

{% macro generate_schema_name(schema_name, node) -%}

    {%- set default_schema = target.schema -%}
    {%- if target.name == 'prod' and schema_name is not none -%}

        {{ schema_name | trim }}

    {%- elif var('ci_schema', 'dummy') != 'dummy' -%}

        {{ var('ci_schema') | trim }}

    {%- else -%}

        {{ default_schema }}

    {%- endif -%}

{%- endmacro %}

@nicor88
Copy link
Member

nicor88 commented Jan 20, 2023

@rumbin I still didn't spot the bug, I have a similar profile to yours (that is very standard).

As you are not specifying the external location per model, the s3_data_dir will be used...

I run the _parse_s3_path function on that path that you use as data_dir, and it return to me a correct bucket name...

@nihakue
Copy link

nihakue commented Jan 23, 2023

I was facing this issue as well when changing the materialization of a model from 'view' to 'table'. You may also be doing the same thing.

The issue is in clean_up_table in imply.py. On line 124 you call table = glue_client.get_table and then later if table is not None: s3_location = table["Table"]["StorageDescriptor"]["Location"] self._delete_from_s3(client, s3_location)

The problem is that get_table will return a response for get_table on a view, but it has an empty 'Location'.

See for example:

p table["Table"]["StorageDescriptor"]
{'Columns': [{'Name': 'impressionid', 'Type': 'string'}, {'Name': 'servertimestamp', 'Type': 'timestamp'}, {'Name': 'devicetype', 'Type': 'string'}, {'Name': 'uid', 'Type': 'string'}, {'Name': 'metadata', 'Type': 'string'}, {'Name': 'os', 'Type': 'string'}, {'Name': 'persistedat', 'Type': 'timestamp'}, {'Name': 'browser', 'Type': 'string'}, {'Name': 'name', 'Type': 'string'}, {'Name': 'pageurl', 'Type': 'string'}, {'Name': 'useragent', 'Type': 'string'}, {'Name': 'id', 'Type': 'string'}, {'Name': 'snapshot_timestamp', 'Type': 'string'}], 'Location': '', 'Compressed': False, 'NumberOfBuckets': 0, 'SerdeInfo': {}, 'SortColumns': [], 'StoredAsSubDirectories': False}
(Pdb) p table["Table"]["StorageDescriptor"]["Location"]
''

Should be easy enough to delete the view manually and then run again, but it would be good if clean_up_table was view aware!

@nicor88
Copy link
Member

nicor88 commented Jan 24, 2023

@nihakue Nice hint. We could easy add an exception for views as recently we added get_relation_type method, that allow us to pick easily the relation and act accordingly.

@rumbin
Copy link
Author

rumbin commented Jan 24, 2023

Wow, @nihakue, this explanation fits perfectly.
The model where we observe this flaw already has a view of the same name existing in the target schema.

In fact, we use views to populate our schemas used for development runs, so we don't need to run everything upstream for each developer environment. This is more handy than using dbt --defer --state.

@nicor88 nicor88 mentioned this issue Feb 3, 2023
3 tasks
@nicor88
Copy link
Member

nicor88 commented Feb 9, 2023

@rumbin this should be fixed in v1.4.0. Give it a shot, if still don't work let us know.

@rumbin
Copy link
Author

rumbin commented Feb 10, 2023

Looks like this issue is fixed now.
Thanks a lot @nicor88 for all your efforts!

@rumbin rumbin closed this as completed Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants