[AIRFLOW-3153] send dag last_run to statsd#3997
[AIRFLOW-3153] send dag last_run to statsd#3997feng-tao wants to merge 1 commit intoapache:masterfrom
Conversation
|
PTAL @kaxil |
ashb
left a comment
There was a problem hiding this comment.
Stats are good, so I'm not going to complain about more stats, but doesn't Airflow's built in SLA do some of this already?
| last_run = processor_manager.get_last_finish_time(file_path) | ||
|
|
||
| file_name = file_path[len(dags_folder) + 1:] | ||
| dag_name = os.path.splitext(file_name)[0].replace(os.sep, '.') |
There was a problem hiding this comment.
A dag file could contain multiple DAGs - is there a reason to not use dag_id here?
| unixtime = last_run.strftime("%s") | ||
| seconds_ago = (timezone.utcnow() - last_run).total_seconds() | ||
| Stats.gauge('last_run.unixtime.{}'.format(dag_name), unixtime) | ||
| Stats.gauge('last_run.seconds_ago.{}'.format(dag_name), seconds_ago) |
There was a problem hiding this comment.
How often does this stat get updated? (I'm not sure from reviewing the PR where we are in the scheduler?) If we were using Promethus I would be tempted to say that just last_run.unixtime stat would be the only one we should have, but I honestly' don't remember how Statsd works anymore.
|
Tests seem to be failing :'( |
Rebasing on to latest master should fix that. |
|
Can we please add this to docs as well, listing all the stats and what do they mean? |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
curious why this was implemented as a gauge and not a timer. |
Make sure you have checked all steps below.
Jira
Description
Lyft has been running with this pr for over an year and numerous production issues have been detected by the stats(e.g setting pageduty on the last run time if it exceeds for certain threshold).
This PR adds statds logging for the DAG generation in Airflow, recording
the time spent processing each file; and
the last time it was processed (both as a unix timestamp and as an interval in seconds).
Credit to original PR owner(@betodealmeida) at lyft
And fix some flake8 error
Tests
Add stats, no need for test.
Commits
Documentation
Code Quality
git diff upstream/master -u -- "*.py" | flake8 --diff