Manual dagrun logs to last scheduled dagruns log files #19058

jedcunningham · 2021-10-18T23:03:25Z

Apache Airflow version

2.2.0 (latest released)

Operating System

MacOS 11.6

Versions of Apache Airflow Providers

No response

Deployment

Virtualenv installation

Deployment details

No response

What happened

Task logs for manual runs are being appended to the last scheduled runs log file.

It seems like it isn't using the logical_date (aka execution_date) for determining the logfile path (though I haven't dug in at all).

What you expected to happen

Each dagrun to have it own, unique log file.

How to reproduce

If you enable this DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator

with DAG(
    "manual_run_execution_date",
    start_date=datetime(2021, 10, 14),
    schedule_interval="0 0 * * *",
) as dag:
    test = BashOperator(task_id="test", bash_command="echo HI")

Once the scheduled runs finish, bring up the log for the last scheduled run (and keep the tab open).
Now schedule a manual run via the UI, wait for it to finish, then look at the log for the manual run.
Note that you'll see 2 runs, one from the scheduler run and one from the manual run.
Confirm with the last scheduled run tab you kept open.
Note at the top of both, you'll see it's pulling from the same log file:

*** Reading local file: /opt/airflow/logs/manual_run_execution_date/test/2021-10-17T00:00:00+00:00/1.log

Anything else

This is a regression in 2.2.0, I wasn't able to reproduce this in 2.1.4.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

uranusjr · 2021-10-19T10:15:03Z

OK, I think I know what's going on. The log handler uses a template to render the log's filename, which (by default) is {{ ti.dag_id }}/{{ ti.task_id }}/{{ ts }}/{{ try_number }}.log. The problem is the {{ ts }} part. Prior to AIP-39, this is simply the ISO-8601 rendition of execution_date, but AIP-39 decoupled a DAG run's identifying timestamp (logical_date) and operating period (data_interval) and split execution_date's semantic meaning into two. This means we needed to decide which semantic some of the "derived" variables, such as ts and ts_nodash, so pick to use. We chose data_interval.start in 2.2.0 because we guessed that's what most people would want.

Another change we made when we decoupled logical_date to data_interval is to "fix" manual DAG runs' data interval. Prior to AIP-39, since a DAG run's operating period is inferred from execution_date, a manual DAG run's data interval is nonsensical since execution_date is set to when the run is triggered doesn't have a logical end time at all. So 2.2.0 introduced new logic to "align" a manual run's data interval to match the most recent completed schedule, but keep its logical_date to indicate the same value as execution_date previously. But this introduces a problem for log file identification with ts, as shown here.

So the easiest way out here is to change the default log filename template to not use {{ ts }} but {{ logical_date|ts }}. But this would also mean that any user-specified custom log_filename_template configuration would still be broken and need to be migrated, which does not sound viable (and compatibility-breaking). Therefore, I think the only viable fix available is to roll back the semantic change we made to ts, ts_nodash etc. so they again indicate execution_date i.e. logical_date. This is quite unfortunate since it's against our preference to make logical_date less prominent in the interface (attempting to nudge people to migrate usages to the modern data interval-based semantic), but probably the only reasonable approach.

potiuk · 2021-10-19T13:01:07Z

Great analysis @uranusjr chapeau-bas ! I think there is another option for custom logs. I believe we could simply ban using ts in the names of the logs. That would mean some additional analysis of the custom logging specification and failing airflow if we detect it - possibly with a very descriptive explanation on how to convert the old logging spec and maybe even a tool to help with it.

Yes it means breaking (somewhat) backwards compatibility, but I think this is the kind of change that we can claim is a fix to behaviour that was previously coincidentally working fine.

I think - when it comes to 'custom' logging we do not have an API specifying clearly what is and what is not breaking compatibility.

uranusjr · 2021-10-19T19:04:26Z

I believe we could simply ban using ts in the names of the logs. That would mean some additional analysis of the custom logging specification and failing airflow if we detect it - possibly with a very descriptive explanation on how to convert the old logging spec and maybe even a tool to help with it.

The problem with this is that, currently the /log page (plus the corresponding API endpoint) is deriving log filenames implicitly at runtime from log_filename_template. So if we change the value of this option, all previous logs would no longer be accessible since Airflow won't be able to correctly derive the old filenames. So I think this change needs to be eased in. My current idea is to

In 2.2.1, change ts et al. back to their pre-2.2.0 meaning so everything can work for now. To prepare for the option change, we add a new column log_filename on TaskInstance that's populated exactly once when the TI is run, to decouple the filenamne derivation logic.
In 2.3.0, deprecate ts et al. Change the default log_filename_template to use run_id instead (which should be unique). Since log filenames are now rendered eagerly, old logs can still be located by their "old" filenames and won't be affected.

uranusjr · 2021-10-20T04:53:21Z

add a new column log_filename on TaskInstance that's populated exactly once when the TI is run, to decouple the filenamne derivation logic.

But considering how poor the performance was when we added a new field on TaskInstance, this might not be a good idea since we need to backfill log_filename values to all existing TI values. A workaround is to add this to DagRun instead, but still god only knows if the data migration performance could be acceptable. Another thing to benchmark on, I guess.

ashb · 2021-10-20T13:05:45Z

But considering how poor the performance was when we added a new field on TaskInstance, this might not be a good idea since we need to backfill log_filename values to all existing TI values. A workaround is to add this to DagRun instead, but still god only knows if the data migration performance could be acceptable. Another thing to benchmark on, I guess.

The poor performance was caused by changing the PK, not just adding a column.

If we add a nullable column it should be near O(1).

Oh, backfilling too. Yeah that would be possibly expensive. We could backfill only on demand? i.e. on model first access of it doesn't have it set it will render andset it to the db?

jedcunningham · 2021-10-20T16:38:56Z

Sounds good to me. Don't forget that try_number will probably need to be rendered in on-demand.

uranusjr · 2021-10-21T11:38:35Z

The problem is we need to finish all backfilling before we change the default filename format (or before the user changes the custom format based on our recommendation), otherwise the old format would be lost forever and we won't be able to backfill anymore, and there's no guarantee each past task instance (or DAG run) would be visited at least once before the change unless we artificially add something.

One workaround I can think of is to instead add a table in the database that records the current (i.e. old after we change the default) format. When we implement the filename format change, we'll insert one single row to that table for TIs with NULL format to use as the fallback. Not particularly elegant though.

calfzhou · 2021-11-02T06:12:57Z

what's the difference between logical_date and data_interval_start?

uranusjr · 2021-11-02T07:23:17Z

data_interval_start is the start of the DAG run's data interval, while logical_date is a date that identifies that DAG run. They are the same for runs scheduled via CRON and timedelta, but not the case for manual runs and @once runs. A custom timetable may also choose to not use the same value for them.

uranusjr · 2021-11-16T17:03:16Z

Tracking the rest of the work in #19625.

iorilan · 2021-11-26T09:04:36Z

Hi @uranusjr this issue seems still reproducible when manually trigger the dag run.
for now using {{data_interval_start}} to replace {{ts}} to make it work .

eliot1785 · 2022-01-20T23:42:06Z

I found this when reading the changelogs. I just wanted to make a comment on ds/ts/etc... personally I use these variables for things like determining the output names of files, like 'myoutput_[ds]_[ts]' (similar to how Airflow determines the log file names, I guess). I'm probably not alone in that respect.

So I'm glad that the change was reverted, because otherwise when I had a manual run it sounds like I would have overwritten my output for my previous run, rather than creating new output files.

I haven't yet learned about the data_interval feature, but I just wanted to say as a user that it would be pretty important to keep the behavior of ds/ts/etc stable - changing them would basically be a breaking change. Perhaps new context variables could be added for data_interval instead, leaving it up to the user which one they want to use.

jedcunningham added kind:bug This is a clearly a bug area:core affected_version:2.2 Issues Reported for 2.2 labels Oct 18, 2021

uranusjr added this to the Airflow 2.2.1 milestone Oct 19, 2021

uranusjr mentioned this issue Oct 19, 2021

Change ds, ts, etc. back to use logical date #19088

Merged

uranusjr self-assigned this Oct 19, 2021

jedcunningham removed this from the Airflow 2.2.1 milestone Oct 20, 2021

uranusjr added this to the Airflow 2.3.0 milestone Nov 2, 2021

uranusjr mentioned this issue Nov 16, 2021

Deprecate template variables ts etc. and change the default log_filename_template #19625

Closed

1 task

uranusjr closed this as completed Nov 16, 2021

uranusjr mentioned this issue Dec 9, 2021

Log filename template records #20165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual dagrun logs to last scheduled dagruns log files #19058

Manual dagrun logs to last scheduled dagruns log files #19058

jedcunningham commented Oct 18, 2021

uranusjr commented Oct 19, 2021 •

edited

potiuk commented Oct 19, 2021

uranusjr commented Oct 19, 2021

uranusjr commented Oct 20, 2021

ashb commented Oct 20, 2021 •

edited

jedcunningham commented Oct 20, 2021

uranusjr commented Oct 21, 2021

calfzhou commented Nov 2, 2021

uranusjr commented Nov 2, 2021 •

edited

uranusjr commented Nov 16, 2021

iorilan commented Nov 26, 2021

eliot1785 commented Jan 20, 2022

Manual dagrun logs to last scheduled dagruns log files #19058

Manual dagrun logs to last scheduled dagruns log files #19058

Comments

jedcunningham commented Oct 18, 2021

Apache Airflow version

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

What happened

What you expected to happen

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

uranusjr commented Oct 19, 2021 • edited

potiuk commented Oct 19, 2021

uranusjr commented Oct 19, 2021

uranusjr commented Oct 20, 2021

ashb commented Oct 20, 2021 • edited

jedcunningham commented Oct 20, 2021

uranusjr commented Oct 21, 2021

calfzhou commented Nov 2, 2021

uranusjr commented Nov 2, 2021 • edited

uranusjr commented Nov 16, 2021

iorilan commented Nov 26, 2021

eliot1785 commented Jan 20, 2022

uranusjr commented Oct 19, 2021 •

edited

ashb commented Oct 20, 2021 •

edited

uranusjr commented Nov 2, 2021 •

edited