Skip to content

Conversation

@szczeles
Copy link
Contributor

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

When s3_delete_objects_operator is used in a dynamic way (for example list of keys comes from s3_list_operator via XCom) there might be a case when the list of keys is empty. In my case it happens when chained operators are removing old files from S3 and there are no old files yet (because this is very first run of DAG).

In case of empty keys hook raises an exception (via boto3):

[2019-06-18 13:23:53,790] {{base_task_runner.py:101}} INFO - Job 115: Subtask delete_old_files [2019-06-18 13:23:53,790] {{cli.py:517}} INFO - Running <TaskInstance: xxxxxx.delete_old_files 2019-06-17T00:00:00+00:00 [running]> on host 82f571f444f5
[2019-06-18 13:23:56,199] {{__init__.py:1580}} ERROR - An error occurred (MalformedXML) when calling the DeleteObjects operation: The XML you provided was not well-formed or did not validate against our published schema
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/airflow/models/__init__.py", line 1441, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/usr/local/lib/python3.6/site-packages/airflow/contrib/operators/s3_delete_objects_operator.py", line 80, in execute
    response = s3_hook.delete_objects(bucket=self.bucket, keys=self.keys)
  File "/usr/local/lib/python3.6/site-packages/airflow/hooks/S3_hook.py", line 542, in delete_objects
    Delete=delete_dict)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (MalformedXML) when calling the DeleteObjects operation: The XML you provided was not well-formed or did not validate against our published schema

The provided patch modifies the operator behavior - if there is nothing to delete from S3 it just returns.

Tests

  • My PR adds the following unit tests

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

Code Quality

  • Passes flake8

Current implementation of s3_delete_objects_operator fails
on empty list of keys. In case when list of keys to remove
comes from dynamic task (like s3_list_operator via XCom)
empty list of keys should be also considered valid.

Currently it throws an exception from boto3 (malformed XML).

The patch changes the behaviour so the operator exits
immedietely if there are no keys to remove.
@szczeles szczeles force-pushed the s3_delete_objects_empty_keys branch from c33b841 to ef3b278 Compare June 18, 2019 13:40
@OmerJog
Copy link
Contributor

OmerJog commented Jun 18, 2019

One might argue that this is the expected behavior.
The hook just expose the behavior given with boto, in my opinion we shouldnt overwrite this behavior.

Why shouldnt this be raised as PR in boto itself?

BTW one might also ask if a list was given but there was nothing to delete should it fail ot success? This should be defined by the library (in this case boto) not by Airflow.

What do you think?

@codecov-io
Copy link

codecov-io commented Jun 18, 2019

Codecov Report

Merging #5428 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5428      +/-   ##
==========================================
+ Coverage   79.09%   79.09%   +<.01%     
==========================================
  Files         485      485              
  Lines       30380    30382       +2     
==========================================
+ Hits        24030    24032       +2     
  Misses       6350     6350
Impacted Files Coverage Δ
...ow/contrib/operators/s3_delete_objects_operator.py 91.3% <100%> (+0.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6eb3262...ef3b278. Read the comment docs.

@ashb
Copy link
Member

ashb commented Jun 18, 2019

What we did in #4475 for a similar case was to add a parameter called not_found_mode which controls the error behaviour. I think a similar approach here would make sense.

@szczeles
Copy link
Contributor Author

@OmerJog I like your point of view!

However, I checked boto3 documentation and it looks the list of objects is marked as REQUIRED (see: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.delete_objects) so I suspect the correct library usage is to "not call the delete_objects if there is nothing to delete". An interesting fact is that In boto2 function delete_keys() works well on empty array provided.

According to your second question - in unix shell the result depends on the "force" flag, see:

~ $ mkdir /tmp/e
~ $ rm /tmp/e/file # deleting non-existing keys
rm: /tmp/e/file: No such file or directory
~ $ echo $?
1
~ $ rm -f /tmp/e/file
~ $ echo $?
0
~ $ rm # deleting empty keys
usage: rm [-f | -i] [-dPRrvW] file ...
       unlink file
~ $ echo $?
64
~ $ rm -f # deleting empty keys
~ $ echo $?
0

I just checked that the operator already works as in the "force" mode because deleting non-existing key results in task success. So currently it works in mixed-force mode: for deleting nonexisting key it works exactly as rm -f, for deleting no keys it works as rm. This PR applies "force" mode for both cases.

@ashb thank you for your suggestion, it makes 100% sense to add a flag here. Should it be named not_found_mode as in IMAP or maybe force to match syntax for unix tools that in general "remove things"?

@ashb
Copy link
Member

ashb commented Jun 19, 2019

not_found_mode isn't quite right for the empty list case, unless it also controls the behaviour when trying to delete a non-existent object. (I don't know how that currently behaves)

Force isn't quite right either - although that does stop rm complaining about non-existent files, it also makes it ignore read-only and delete anyway, which isn't something we'd do here.

In short: names are hard, I don't know. It doesn't have to match the name of the flag from the imap hook, no

@szczeles
Copy link
Contributor Author

I see. OK, let me try with PR to boto3 to check their opinion. If they consider handling empty array an improvement then the only need is to bump boto3 version in airflow codebase.

Otherwise, I'd suggest going with the current patch (so without parameters) to not complicate the simple case. This is anyway used when the operator input is fed dynamically. What do you think? @ashb @jomar83

def execute(self, context):
s3_hook = S3Hook(aws_conn_id=self.aws_conn_id, verify=self.verify)

if isinstance(self.keys, list) and len(self.keys) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be simplified to:

Suggested change
if isinstance(self.keys, list) and len(self.keys) == 0:
if not self.keys:

Additionally move it above the s3_hook = S3Hook() creation -- that connects to the AWS servers, and there's no need to do that if we're going to short-circuit.

@mik-laj mik-laj added the provider:amazon AWS/Amazon - related issues label Jul 27, 2019
@szczeles
Copy link
Contributor Author

After a few experiments I implemented this using ShortCircuitOperator by skipping delete operator if list of files is empty. It's more straightforward solution and doesn't require adding any extra params in the current operators.

Therefore I'm closing this PR. Thanks for your review!

@szczeles szczeles closed this Sep 26, 2019
@szczeles szczeles deleted the s3_delete_objects_empty_keys branch September 26, 2019 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants