Skip to content

Fetch log file from multiple worker#21230

Closed
avkirilishin wants to merge 1 commit intoapache:mainfrom
avkirilishin:fetch_log_files_from_multiple_hosts
Closed

Fetch log file from multiple worker#21230
avkirilishin wants to merge 1 commit intoapache:mainfrom
avkirilishin:fetch_log_files_from_multiple_hosts

Conversation

@avkirilishin
Copy link
Contributor

closes: #16472

A robust solution requires TI try history. I'm not sure it's worth it. So this PR is a workaround: we try to find an appropriate host in all task instances. Some hosts can disappear from TI-table, but I think it's a rare case.

Solution tested at the 3-host cluster.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@avkirilishin avkirilishin force-pushed the fetch_log_files_from_multiple_hosts branch 2 times, most recently from 768aa8a to 7cbda7b Compare February 2, 2022 21:14
@avkirilishin avkirilishin requested a review from uranusjr February 2, 2022 21:22
@potiuk
Copy link
Member

potiuk commented Feb 15, 2022

You need to rebase

@avkirilishin avkirilishin force-pushed the fetch_log_files_from_multiple_hosts branch from 7cbda7b to dc8dcaa Compare February 23, 2022 17:56
@avkirilishin
Copy link
Contributor Author

You need to rebase

@potiuk Done

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very wrong solution.

Yeah. It will work fine for 3 hosts and small number of task instances, but what if you have 50 hostnames and millions of task instances?

The solution is extremely "brute-force" solution to a problem. The query below has terrible characteristics: Basically it performs full table scan over one of the potentially HUGE TaskInstance table in order to retrieve unique hostnames used during the last two days (!). Without any index on hostname. This will kill the database and will run for minutes on a huge deployment.

                hosts = (
                    session.query(TaskInstance.hostname)
                    .filter(
                        TaskInstance.hostname != '',
                        TaskInstance.end_date > (timezone.utcnow() - LOG_SEARCH_INTERVAL),
                    )
                    .distinct()
                )

And it seems that "proper" fixing it - i.e. storing a list of hosts that diiferent retries of the same task instance used - should be ralatively easy.

Simply make the "hostname" field to accept both "hostname" and "hostname array" (coma separated) for example) this shoudl be rather easy change, you can also easily make a round-robin if the total array lenght will be too long (1000 characters is the current limit) or we could even increase the limit of the hostname field if we are worried about it. It is not part of any index nor searched for so it can be even unlimited.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Apr 14, 2022
@github-actions github-actions bot closed this Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logging stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failed to fetch log file from worker because webserver did not search the right worker.

3 participants