[AIRFLOW-4184] Add an athena helper to insert into table by bryanyang0528 · Pull Request #4996 · apache/airflow

bryanyang0528 · 2019-03-28T17:44:50Z

Make sure you have checked all steps below.

Jira

My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-4184] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-4184
- In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-4184], code changes always need a Jira issue.
- In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

Here are some details about my PR, including screenshots of any UI changes:

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.
- All the public functions and the classes in the PR contain docstrings that explain what it does
- If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

Passes flake8

ashb

The implementation needs some work.

Is this even needed, couldn't this be done via an SQL statement in the Athena operator?

ashb · 2019-03-29T10:32:17Z

airflow/contrib/hooks/aws_athena_hook.py

        return self.conn.stop_query_execution(QueryExecutionId=query_execution_id)
+
+
+class AWSAthenaHelpers(AWSAthenaHook):


This class name doesn't follow any of Airflow's conventions.

This feels like an operator, not a hook class.

Because original Athena didn't support INSERT INTO TABLE clause, so I'm not what kind of this function belong to.

ashb · 2019-03-29T10:32:30Z

airflow/contrib/hooks/aws_athena_hook.py

    SUCCESS_STATES = ('SUCCEEDED',)

-    def __init__(self, aws_conn_id='aws_default', sleep_time=30, *args, **kwargs):
+    def __init__(self, aws_conn_id='aws_default', region_name=None, sleep_time=30, *args, **kwargs):


This shouldn't be needed - this already comes from the connection

ashb · 2019-03-29T10:33:31Z

airflow/contrib/hooks/aws_athena_hook.py

+
+    def run_insert_into_table(self, src_db, src_table, dst_db, dst_table, mode='error'):
+        """
+        insert data in s3 from the source table to the destination table


Why is this going via S3 rather than issuing an athena query?

Because original Athena didn't support INSERT INTO TABLE AS SELECT clause, so actually we need to create temp_table as select and move data from the location of temp table to the s3 location of the target table.

@bryanyang0528
Can you please explain the use case for this?

When I want to add data to Athena I just upload files to the same directory and it works like a charm as Athena scans the whole folder.
For cases where I upload files to another path I just need to update the table partition with:
ALTER TABLE _____ADD IF NOT EXISTS PARTITION (dt=____) location 's3://____
https://stackoverflow.com/questions/50164744/how-to-efficiently-append-new-data-to-table-in-aws-athena

@RosterIn Thank you for your suggestion. I knew ATLER TABLE can assign an s3 location for partitions. But I would like to move data to the same s3 location after creating a new partition.

For example, I have a table called user_profile in which I aggregated user behavior data on my website. I have to select data from the source table daily and insert the result to user_profile. In this case, in Hive, I just use insert into table user_profile partition (dt) as select count(*) from src_table. And data will be put into the same location as the target table.

Maybe I could combine Alter table and s3 copy to implement this idea. Does it make sense?

@bryanyang0528 better to wait for @ashb respond.

@RosterIn Thx.

If (and it is still and if) we add this to Airflow this function belongs in an Operator, not a Hook.

Additionally you are putting S3 files in place but doing nothing to update the Glue catalog.

Overall I am skeptical that this approach is the right way of doing it. In the past I have done this sort of thing from an EMR cluster using the Glue-compatible hive metastore to insert into tables.

Thank you for the suggestion. I will close this issue and create a new one. I know EMR could do that thing, but it takes a long time to create an EMR cluster and compute on the EMR.

bryanyang0528 added 2 commits March 29, 2019 01:36

add insert into table

487dca2

add tests for Athena helper

660714f

bryanyang0528 force-pushed the add_an_athena_helper_inster_into_table branch from 9adfc83 to 660714f Compare March 29, 2019 00:54

ashb requested changes Mar 29, 2019

View reviewed changes

bryanyang0528 closed this Jun 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-4184] Add an athena helper to insert into table#4996

[AIRFLOW-4184] Add an athena helper to insert into table#4996
bryanyang0528 wants to merge 2 commits intoapache:masterfrom
bryanyang0528:add_an_athena_helper_inster_into_table

bryanyang0528 commented Mar 28, 2019

Uh oh!

ashb left a comment

Uh oh!

ashb Mar 29, 2019

Uh oh!

bryanyang0528 Mar 29, 2019 •

edited

Loading

Uh oh!

ashb Mar 29, 2019

Uh oh!

ashb Mar 29, 2019

Uh oh!

bryanyang0528 Mar 29, 2019 •

edited

Loading

Uh oh!

RosterIn Apr 30, 2019

Uh oh!

bryanyang0528 Apr 30, 2019 •

edited

Loading

Uh oh!

RosterIn May 5, 2019

Uh oh!

bryanyang0528 May 13, 2019

Uh oh!

ashb May 30, 2019

Uh oh!

bryanyang0528 Jun 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return self.conn.stop_query_execution(QueryExecutionId=query_execution_id)


		class AWSAthenaHelpers(AWSAthenaHook):

Conversation

bryanyang0528 commented Mar 28, 2019

Jira

Description

Tests

Commits

Documentation

Code Quality

Uh oh!

ashb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanyang0528 Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanyang0528 Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bryanyang0528 Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bryanyang0528 Mar 29, 2019 •

edited

Loading

bryanyang0528 Mar 29, 2019 •

edited

Loading

bryanyang0528 Apr 30, 2019 •

edited

Loading