Skip to content

[AIRFLOW-6944] Allow AWS DataSync to "catch up" when Task is already …#7585

Closed
baolsen wants to merge 1 commit intoapache:masterfrom
baolsen:aws_datasync_task_statuses_20200226
Closed

[AIRFLOW-6944] Allow AWS DataSync to "catch up" when Task is already …#7585
baolsen wants to merge 1 commit intoapache:masterfrom
baolsen:aws_datasync_task_statuses_20200226

Conversation

@baolsen
Copy link
Contributor

@baolsen baolsen commented Feb 28, 2020

…running


Issue link: AIRFLOW-6944

Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Commit message/PR title starts with [AIRFLOW-NNNN]. AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

* For document-only changes commit message can start with [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@boring-cyborg boring-cyborg bot added the provider:amazon AWS/Amazon - related issues label Feb 28, 2020
@codecov-io
Copy link

codecov-io commented Feb 28, 2020

Codecov Report

Merging #7585 into master will decrease coverage by 0.32%.
The diff coverage is 23.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7585      +/-   ##
==========================================
- Coverage   86.85%   86.52%   -0.33%     
==========================================
  Files         896      896              
  Lines       42622    42657      +35     
==========================================
- Hits        37021    36911     -110     
- Misses       5601     5746     +145
Impacted Files Coverage Δ
airflow/providers/amazon/aws/hooks/datasync.py 16.78% <0%> (ø) ⬆️
airflow/providers/amazon/aws/operators/datasync.py 32.63% <23.8%> (-1.57%) ⬇️
airflow/kubernetes/volume_mount.py 44.44% <0%> (-55.56%) ⬇️
airflow/kubernetes/volume.py 52.94% <0%> (-47.06%) ⬇️
airflow/kubernetes/pod_launcher.py 47.18% <0%> (-45.08%) ⬇️
...viders/cncf/kubernetes/operators/kubernetes_pod.py 69.69% <0%> (-25.26%) ⬇️
airflow/kubernetes/refresh_config.py 50.98% <0%> (-23.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbfc18e...f7c5571. Read the comment docs.

@baolsen
Copy link
Contributor Author

baolsen commented Mar 16, 2020

Hi @feluelle / @nuclearpinguin , would appreciate a review :)

One thing I am aware of - currently there are 2 wait loops:

  1. Wait for task to be able to be started - implemented in the DataSync Operator.
  2. Wait for task execution to be completed - implemented in the DataSync hook. (Task execution is what you get from AWS after triggering a Task to start)

What is your recommendation to standardise these?
I have no preference... Just don't want a Sensor because the Operator is currently self-contained in terms of creating/deleting/running tasks. I'd like to move both wait loops either into the hook or the operator, not sure where is typically the best place.

@turbaszek turbaszek requested a review from feluelle March 16, 2020 12:23
def _execute_datasync_task(self):
"""Create and monitor an AWSDataSync TaskExecution for a Task."""
hook = self.get_hook()
def _wait_get_status_before_start(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _wait_get_status_before_start(
def _wait_for_status(

@turbaszek
Copy link
Member

  1. Wait for task to be able to be started - implemented in the DataSync Operator.
  2. Wait for task execution to be completed - implemented in the DataSync hook. (Task execution is what you get from AWS after triggering a Task to start)

What is your recommendation to standardise these?

I would say that we should not wait for task to be able to start. If something has to happen before the task we should use sensor or wait for the completion in the upstream operator. For example in case of CreateResource >> UseResource the CreateResource op should check if the resource was successfully created. This is the approach we use in GCP operators/hooks.

self.log.info(
'The Task will be started because its status is in %s.',
self.TASK_STATUS_START)
# Start the DataSync Task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Start the DataSync Task

Comment on lines +247 to +250
raise AirflowException(
'Task cannot be started because its status is in %s.'
% self.TASK_STATUS_FAIL
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise AirflowException(
'Task cannot be started because its status is in %s.'
% self.TASK_STATUS_FAIL
)
raise AirflowException(
f'Task cannot be started because its status is in {self.TASK_STATUS_FAIL}.'
)

Please use f-strings :)

% self.TASK_STATUS_FAIL
)
else:
raise AirflowException('Unexpected task status %s.' % self.task_status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise AirflowException('Unexpected task status %s.' % self.task_status)
raise AirflowException(f'Unexpected task status {self.task_status}')

Comment on lines +385 to +403
hook = self.get_hook()
task_status = hook.get_task_description(self.task_arn)['Status']
iteration = 0
while task_status in self.TASK_STATUS_WAIT_BEFORE_START:
self.log.info(
'Task status is %s.'
' Waiting for it to not be %s.'
' Iteration %s/%s.',
task_status,
self.TASK_STATUS_WAIT_BEFORE_START,
iteration,
max_iterations)
time.sleep(self.wait_interval_seconds)
task_status = hook.get_task_description(self.task_arn)['Status']
iteration = iteration + 1
if iteration >= max_iterations:
break

return task_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hook = self.get_hook()
task_status = hook.get_task_description(self.task_arn)['Status']
iteration = 0
while task_status in self.TASK_STATUS_WAIT_BEFORE_START:
self.log.info(
'Task status is %s.'
' Waiting for it to not be %s.'
' Iteration %s/%s.',
task_status,
self.TASK_STATUS_WAIT_BEFORE_START,
iteration,
max_iterations)
time.sleep(self.wait_interval_seconds)
task_status = hook.get_task_description(self.task_arn)['Status']
iteration = iteration + 1
if iteration >= max_iterations:
break
return task_status
hook = self.get_hook()
for iteration in range(max_iterations):
task_status = hook.get_task_description(self.task_arn)['Status']
self.log.info(
'Task status is %s.'
' Waiting for it to not be %s.'
' Iteration %s/%s.',
task_status,
self.TASK_STATUS_WAIT_BEFORE_START,
iteration,
max_iterations)
if task_status not in self.TASK_STATUS_WAIT_BEFORE_START:
break
time.sleep(self.wait_interval_seconds)
return task_status

WDYT?

hook = self.get_hook()
def _wait_get_status_before_start(
self,
max_iterations=12 * 180): # wait_interval_seconds*12*180=180 minutes by default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
max_iterations=12 * 180): # wait_interval_seconds*12*180=180 minutes by default
max_iterations : int = 12 * 180) -> str:

You already have that comment in the docs below. Even better it would be to add :param and add the comment there.

The Task can be started when its Status is not in TASK_STATUS_WAIT_BEFORE_START
Uses wait_interval_seconds (which is also used while waiting for TaskExecution)
So, max_iterations=12*180 gives 180 minutes wait by default.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add :returns: here.

@feluelle
Copy link
Member

feluelle commented Mar 21, 2020

  1. Wait for task to be able to be started - implemented in the DataSync Operator.
  2. Wait for task execution to be completed - implemented in the DataSync hook. (Task execution is what you get from AWS after triggering a Task to start)

What is your recommendation to standardise these?

I would say that we should not wait for task to be able to start. If something has to happen before the task we should use sensor or wait for the completion in the upstream operator. For example in case of CreateResource >> UseResource the CreateResource op should check if the resource was successfully created. This is the approach we use in GCP operators/hooks.

I think both options are useful to have. To #7410 (Tableau Integration) I added a blocking parameter which in case is set to True enables a sensor task. If set to False it does not wait (=async).
See: https://github.com/apache/airflow/pull/7410/files#diff-4f65033904d8c2b6c064e959d218d513R71

@feluelle
Copy link
Member

feluelle commented Apr 8, 2020

@baolsen gentle ping :)

@stale
Copy link

stale bot commented May 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label May 23, 2020
@stale stale bot closed this May 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:amazon AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants