Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unload_table macro #7

Merged
merged 6 commits into from
Nov 7, 2017
Merged

Add unload_table macro #7

merged 6 commits into from
Nov 7, 2017

Conversation

abelsonlive
Copy link
Contributor

@abelsonlive abelsonlive commented Nov 3, 2017

WHAT

This PR adds an unload_table macro for dumping a Redshift table to S3.

WHY

Many data science applications built on top of Redshift require passing the outputs of a query along to another application. Using unload_table in a post-hook is a nice way of making the results of a dbt run accessible to other applications which can access S3.

HOW

unload_table closely follows the implementation of UNLOAD in Redshift.

unload_table includes sane defaults such as:

  • Requiring that the table to unload (table) and the location to unload it to (s3_path) are always set.
  • Requiring that IAM_ROLE or ACCESS_KEY_ID and SECRET_ACCESS_KEY are always set.
  • [Always escaping the DELIMITER].(http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html#unload-usage-notes)
  • NULL AS set to ''.
  • DELIMITER set to ,.
  • MANIFEST turned off.
  • MAX_FILE_SIZE set to 6 GB.
  • ADDQUOTES turned off.
  • COMPRESSION turned off.
  • ENCRYPTED turned off.
  • ALLOWOVERWRITE turned off.
  • PARALLEL turned off.

ISSUES

Ideally, unload_table would be handled by a more generalized macro unload which would accept a query string rather than a table name. However, my attempts to implement this inside a model were unsuccessful. My idea was that I could do something like this:

-- select all from the output table
{% set query='SELECT * FROM ' + this %}
{{
  config({
    'materialized': 'table',
    'post-hook': [
      '{{ unload(query=query,
                 s3_path=var("s3_path"),
                 aws_key=var("aws_key"),
                 aws_secret=var("aws_secret") }}'
    ]
  })
}}
SELECT * FROM table

But, presumably because this is a special variable, it raised the following error:

Encountered an error:
must be str, not Relation

A second, more crude approach, attempted the following:

-- select all from the output table
{% set query='SELECT * FROM ' + this.schema + '.' + this.table %}
{{
  config({
    'materialized': 'table',
    'post-hook': [
      '{{ unload(query=query,
                 s3_path=var("s3_path"),
                 aws_key=var("aws_key"),
                 aws_secret=var("aws_secret") }}'
    ]
  })
}}
SELECT * FROM table

But this returned the following error:

Database Error in model unload_test.sql (models/unload_test.sql)
  select query cannot be empty/blank.

I assume there are a few different approaches to this macro that might address this issue, but for now i think this is extremely useful as-is!

@drewbanin
Copy link
Contributor

drewbanin commented Nov 3, 2017

hey @abelsonlive - This is a great macro and a great PR :)

this and ref() return special Relation objects. You can render them as strings like this:

{% set query = 'select * from ' + (this | string) %}

Or alternatively, use the concatenation operator to coerce it to a string:

{% set query = 'select * from ' ~ this %}

I think that you'll have a problem getting this to work right though. When the hook is parsed, it doesn't have access to the local scope of the model, so query variable that you're creating will be None I believe.

All that is to say: I think your final implementation is a good one!

I plan on adding docs to this repo in a similar fashion to dbt-utils. Once that's ready, I think we'll be good to write some docs for this macro then get it merged in!

@drewbanin drewbanin self-requested a review November 3, 2017 20:53
@drewbanin
Copy link
Contributor

@abelsonlive I just updated the docs here -- can you add an entry for the unload_table macro in README.md? Otherwise LGTM

@abelsonlive
Copy link
Contributor Author

abelsonlive commented Nov 7, 2017

Alright! I'm going to open another PR for the docs.

@drewbanin
Copy link
Contributor

@abelsonlive you can just make that commit in this PR if it's easier for you!

@abelsonlive
Copy link
Contributor Author

Okay! I added docs and a schema argument to replicate your implementation of compress_table.

@abelsonlive abelsonlive closed this Nov 7, 2017
@abelsonlive abelsonlive reopened this Nov 7, 2017
@abelsonlive
Copy link
Contributor Author

whups accidentally pressed Close and Comment

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more tiny comment in the docs (typo), but then this looks good to me! Thanks so much for submitting @abelsonlive

README.md Outdated
s3_path='s3://bucket/folder',
aws_key='abcdef',
aws_secret='ghijklm',
delimter='|') }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be delimiter :)

@drewbanin drewbanin merged commit bbfcd8c into dbt-labs:master Nov 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants