Skip to content

[AIRFLOW-4184] Add an athena helper to insert into table#4996

Closed
bryanyang0528 wants to merge 2 commits intoapache:masterfrom
bryanyang0528:add_an_athena_helper_inster_into_table
Closed

[AIRFLOW-4184] Add an athena helper to insert into table#4996
bryanyang0528 wants to merge 2 commits intoapache:masterfrom
bryanyang0528:add_an_athena_helper_inster_into_table

Conversation

@bryanyang0528
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

@bryanyang0528 bryanyang0528 force-pushed the add_an_athena_helper_inster_into_table branch from 9adfc83 to 660714f Compare March 29, 2019 00:54
Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation needs some work.

Is this even needed, couldn't this be done via an SQL statement in the Athena operator?

return self.conn.stop_query_execution(QueryExecutionId=query_execution_id)


class AWSAthenaHelpers(AWSAthenaHook):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class name doesn't follow any of Airflow's conventions.

This feels like an operator, not a hook class.

Copy link
Contributor Author

@bryanyang0528 bryanyang0528 Mar 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because original Athena didn't support INSERT INTO TABLE clause, so I'm not what kind of this function belong to.

SUCCESS_STATES = ('SUCCEEDED',)

def __init__(self, aws_conn_id='aws_default', sleep_time=30, *args, **kwargs):
def __init__(self, aws_conn_id='aws_default', region_name=None, sleep_time=30, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed - this already comes from the connection


def run_insert_into_table(self, src_db, src_table, dst_db, dst_table, mode='error'):
"""
insert data in s3 from the source table to the destination table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this going via S3 rather than issuing an athena query?

Copy link
Contributor Author

@bryanyang0528 bryanyang0528 Mar 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because original Athena didn't support INSERT INTO TABLE AS SELECT clause, so actually we need to create temp_table as select and move data from the location of temp table to the s3 location of the target table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanyang0528
Can you please explain the use case for this?

When I want to add data to Athena I just upload files to the same directory and it works like a charm as Athena scans the whole folder.
For cases where I upload files to another path I just need to update the table partition with:
ALTER TABLE _____ADD IF NOT EXISTS PARTITION (dt=____) location 's3://____
https://stackoverflow.com/questions/50164744/how-to-efficiently-append-new-data-to-table-in-aws-athena

Copy link
Contributor Author

@bryanyang0528 bryanyang0528 Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosterIn Thank you for your suggestion. I knew ATLER TABLE can assign an s3 location for partitions. But I would like to move data to the same s3 location after creating a new partition.

For example, I have a table called user_profile in which I aggregated user behavior data on my website. I have to select data from the source table daily and insert the result to user_profile. In this case, in Hive, I just use insert into table user_profile partition (dt) as select count(*) from src_table. And data will be put into the same location as the target table.

Maybe I could combine Alter table and s3 copy to implement this idea. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanyang0528 better to wait for @ashb respond.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RosterIn Thx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If (and it is still and if) we add this to Airflow this function belongs in an Operator, not a Hook.

Additionally you are putting S3 files in place but doing nothing to update the Glue catalog.

Overall I am skeptical that this approach is the right way of doing it. In the past I have done this sort of thing from an EMR cluster using the Glue-compatible hive metastore to insert into tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I will close this issue and create a new one. I know EMR could do that thing, but it takes a long time to create an EMR cluster and compute on the EMR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants