Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual dagrun logs to last scheduled dagruns log files #19058

Closed
2 tasks done
jedcunningham opened this issue Oct 18, 2021 · 12 comments
Closed
2 tasks done

Manual dagrun logs to last scheduled dagruns log files #19058

jedcunningham opened this issue Oct 18, 2021 · 12 comments
Assignees
Labels
affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug
Milestone

Comments

@jedcunningham
Copy link
Member

Apache Airflow version

2.2.0 (latest released)

Operating System

MacOS 11.6

Versions of Apache Airflow Providers

No response

Deployment

Virtualenv installation

Deployment details

No response

What happened

Task logs for manual runs are being appended to the last scheduled runs log file.

It seems like it isn't using the logical_date (aka execution_date) for determining the logfile path (though I haven't dug in at all).

What you expected to happen

Each dagrun to have it own, unique log file.

How to reproduce

If you enable this DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

with DAG(
    "manual_run_execution_date",
    start_date=datetime(2021, 10, 14),
    schedule_interval="0 0 * * *",
) as dag:
    test = BashOperator(task_id="test", bash_command="echo HI")

Once the scheduled runs finish, bring up the log for the last scheduled run (and keep the tab open).
Now schedule a manual run via the UI, wait for it to finish, then look at the log for the manual run.
Note that you'll see 2 runs, one from the scheduler run and one from the manual run.
Confirm with the last scheduled run tab you kept open.
Note at the top of both, you'll see it's pulling from the same log file:

*** Reading local file: /opt/airflow/logs/manual_run_execution_date/test/2021-10-17T00:00:00+00:00/1.log

Anything else

This is a regression in 2.2.0, I wasn't able to reproduce this in 2.1.4.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@jedcunningham jedcunningham added kind:bug This is a clearly a bug area:core affected_version:2.2 Issues Reported for 2.2 labels Oct 18, 2021
@uranusjr
Copy link
Member

uranusjr commented Oct 19, 2021

OK, I think I know what's going on. The log handler uses a template to render the log's filename, which (by default) is {{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log. The problem is the {{ ts }} part. Prior to AIP-39, this is simply the ISO-8601 rendition of execution_date, but AIP-39 decoupled a DAG run's identifying timestamp (logical_date) and operating period (data_interval) and split execution_date's semantic meaning into two. This means we needed to decide which semantic some of the "derived" variables, such as ts and ts_nodash, so pick to use. We chose data_interval.start in 2.2.0 because we guessed that's what most people would want.

Another change we made when we decoupled logical_date to data_interval is to "fix" manual DAG runs' data interval. Prior to AIP-39, since a DAG run's operating period is inferred from execution_date, a manual DAG run's data interval is nonsensical since execution_date is set to when the run is triggered doesn't have a logical end time at all. So 2.2.0 introduced new logic to "align" a manual run's data interval to match the most recent completed schedule, but keep its logical_date to indicate the same value as execution_date previously. But this introduces a problem for log file identification with ts, as shown here.

So the easiest way out here is to change the default log filename template to not use {{ ts }} but {{ logical_date|ts }}. But this would also mean that any user-specified custom log_filename_template configuration would still be broken and need to be migrated, which does not sound viable (and compatibility-breaking). Therefore, I think the only viable fix available is to roll back the semantic change we made to ts, ts_nodash etc. so they again indicate execution_date i.e. logical_date. This is quite unfortunate since it's against our preference to make logical_date less prominent in the interface (attempting to nudge people to migrate usages to the modern data interval-based semantic), but probably the only reasonable approach.

@uranusjr uranusjr added this to the Airflow 2.2.1 milestone Oct 19, 2021
@potiuk
Copy link
Member

potiuk commented Oct 19, 2021

Great analysis @uranusjr chapeau-bas ! I think there is another option for custom logs. I believe we could simply ban using ts in the names of the logs. That would mean some additional analysis of the custom logging specification and failing airflow if we detect it - possibly with a very descriptive explanation on how to convert the old logging spec and maybe even a tool to help with it.

Yes it means breaking (somewhat) backwards compatibility, but I think this is the kind of change that we can claim is a fix to behaviour that was previously coincidentally working fine.

I think - when it comes to 'custom' logging we do not have an API specifying clearly what is and what is not breaking compatibility.

@uranusjr
Copy link
Member

I believe we could simply ban using ts in the names of the logs. That would mean some additional analysis of the custom logging specification and failing airflow if we detect it - possibly with a very descriptive explanation on how to convert the old logging spec and maybe even a tool to help with it.

The problem with this is that, currently the /log page (plus the corresponding API endpoint) is deriving log filenames implicitly at runtime from log_filename_template. So if we change the value of this option, all previous logs would no longer be accessible since Airflow won't be able to correctly derive the old filenames. So I think this change needs to be eased in. My current idea is to

  1. In 2.2.1, change ts et al. back to their pre-2.2.0 meaning so everything can work for now. To prepare for the option change, we add a new column log_filename on TaskInstance that's populated exactly once when the TI is run, to decouple the filenamne derivation logic.
  2. In 2.3.0, deprecate ts et al. Change the default log_filename_template to use run_id instead (which should be unique). Since log filenames are now rendered eagerly, old logs can still be located by their "old" filenames and won't be affected.

@uranusjr
Copy link
Member

add a new column log_filename on TaskInstance that's populated exactly once when the TI is run, to decouple the filenamne derivation logic.

But considering how poor the performance was when we added a new field on TaskInstance, this might not be a good idea since we need to backfill log_filename values to all existing TI values. A workaround is to add this to DagRun instead, but still god only knows if the data migration performance could be acceptable. Another thing to benchmark on, I guess.

@ashb
Copy link
Member

ashb commented Oct 20, 2021

But considering how poor the performance was when we added a new field on TaskInstance, this might not be a good idea since we need to backfill log_filename values to all existing TI values. A workaround is to add this to DagRun instead, but still god only knows if the data migration performance could be acceptable. Another thing to benchmark on, I guess.

The poor performance was caused by changing the PK, not just adding a column.

If we add a nullable column it should be near O(1).

Oh, backfilling too. Yeah that would be possibly expensive. We could backfill only on demand? i.e. on model first access of it doesn't have it set it will render andset it to the db?

@jedcunningham
Copy link
Member Author

Sounds good to me. Don't forget that try_number will probably need to be rendered in on-demand.

@jedcunningham jedcunningham removed this from the Airflow 2.2.1 milestone Oct 20, 2021
@uranusjr
Copy link
Member

The problem is we need to finish all backfilling before we change the default filename format (or before the user changes the custom format based on our recommendation), otherwise the old format would be lost forever and we won't be able to backfill anymore, and there's no guarantee each past task instance (or DAG run) would be visited at least once before the change unless we artificially add something.

One workaround I can think of is to instead add a table in the database that records the current (i.e. old after we change the default) format. When we implement the filename format change, we'll insert one single row to that table for TIs with NULL format to use as the fallback. Not particularly elegant though.

@calfzhou
Copy link
Contributor

calfzhou commented Nov 2, 2021

what's the difference between logical_date and data_interval_start?

@uranusjr
Copy link
Member

uranusjr commented Nov 2, 2021

data_interval_start is the start of the DAG run's data interval, while logical_date is a date that identifies that DAG run. They are the same for runs scheduled via CRON and timedelta, but not the case for manual runs and @once runs. A custom timetable may also choose to not use the same value for them.

@uranusjr
Copy link
Member

Tracking the rest of the work in #19625.

@iorilan
Copy link

iorilan commented Nov 26, 2021

Hi @uranusjr this issue seems still reproducible when manually trigger the dag run.
for now using {{data_interval_start}} to replace {{ts}} to make it work .

@eliot1785
Copy link

I found this when reading the changelogs. I just wanted to make a comment on ds/ts/etc... personally I use these variables for things like determining the output names of files, like 'myoutput_[ds]_[ts]' (similar to how Airflow determines the log file names, I guess). I'm probably not alone in that respect.

So I'm glad that the change was reverted, because otherwise when I had a manual run it sounds like I would have overwritten my output for my previous run, rather than creating new output files.

I haven't yet learned about the data_interval feature, but I just wanted to say as a user that it would be pretty important to keep the behavior of ds/ts/etc stable - changing them would basically be a breaking change. Perhaps new context variables could be added for data_interval instead, leaving it up to the user which one they want to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

7 participants