Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I Improve on performance #840

Closed
zeedevio opened this issue Feb 7, 2024 · 10 comments
Closed

How can I Improve on performance #840

zeedevio opened this issue Feb 7, 2024 · 10 comments
Assignees
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:run Primarily related to dbt run command or functionality epic-assigned execution:local Related to Local execution environment
Milestone

Comments

@zeedevio
Copy link

zeedevio commented Feb 7, 2024

I'm using Astro Airflow locally.

So I'm trying to execute DBT transformation models using airflow and find it to be extremely slow.
I'm looking for help to improve the performance of my airflow dag.

I'm trying to run 56 transformation models, I have done the following test:

  • DBT cloud: it completes my model transformation in 4 minutes

  • Testing Profile Locally: using dbt run --profiles-dir /usr/local/airflow/include/dbt/ it completes my transformations in 6 minutes
    Screenshot 2024-02-07 at 11 40 57

  • AIRFLOW:: it takes 44 minutes and i get a fail

Screenshot 2024-02-07 at 11 46 36

I'm looking for a way to run my models much faster.

Copy link

dosubot bot commented Feb 7, 2024

🤖

Hey @zeedevio! Great to see you back here, bringing more interesting challenges to the table. Hope we can tackle this one together too! 🚀

Based on the information you've provided, there are several potential areas to explore to improve the performance of your Airflow DAG when executing DBT transformation models.

Firstly, it appears that you're installing dependencies every time the DAG runs. This can significantly slow down the execution time. If your dependencies don't change frequently, you might want to consider installing them once and reusing the environment. You can set install_deps to False after the first run. This is demonstrated in the cosmos_manifest_example.py file in the repository.

Secondly, DBT supports parallel model execution, which can significantly speed up the transformation process. However, this feature might not be fully utilized when running DBT transformations within Airflow, depending on your Airflow configuration. You might want to check your Airflow configuration to ensure it allows for parallel task execution.

Thirdly, the performance can also be influenced by the resources allocated to your Airflow and DBT tasks. Ensure that your tasks have enough memory and CPU resources to run efficiently.

Fourthly, the performance of DBT transformations can also be influenced by the volume and complexity of the data being processed. If you're dealing with large volumes of data or complex transformations, consider optimizing your DBT models or breaking down the transformations into smaller, more manageable tasks.

Lastly, if your Airflow instance and the database are not in the same network, there might be network latency issues. Consider reducing network latency by bringing them closer, for example, in the same VPC or using a direct connection.

Please note that these are general suggestions. The exact solution might vary depending on your specific setup and requirements. If you could provide more information about your current Airflow configuration, particularly in terms of parallel task execution, it would be helpful in providing more specific advice.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosubot dosubot bot added area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:run Primarily related to dbt run command or functionality execution:local Related to Local execution environment labels Feb 7, 2024
@kzajaczkowski
Copy link

Having very similar performance issue.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 16, 2024
Copy link

dosubot bot commented May 16, 2024

Hi, @zeedevio,

I'm helping the Cosmos team manage our backlog and am marking this issue as stale. From what I understand, you opened this issue to address the slow performance of running DBT transformation models using Astro Airflow locally. In response, I provided several potential areas to explore for improving the performance, and kzajaczkowski commented that they are experiencing a similar performance issue.

Could you please confirm if this issue is still relevant to the latest version of the Cosmos repository? If it is, please let the Cosmos team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you!

@tatiana
Copy link
Collaborator

tatiana commented May 17, 2024

Hi, @zeedevio. This is something we're currently working on and trying to improve. There have been significant improvements in Cosmos 1.4:

Please, could you confirm how is the performance for you after upgrading?

@tatiana tatiana added this to the 1.5.0 milestone May 17, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 17, 2024
@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@tatiana tatiana self-assigned this May 17, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jun 5, 2024

Some progress: #1014.

@zeedevio
Copy link
Author

zeedevio commented Jun 5, 2024

Hey Good day,
Sorry for the late reply but I was able to speed up my transformation workflow models.

In my case I used GCP - Composer. Things I done to speed up transformations was to update Airflow Configuration:

Configuration New Value
dagbag_import_timeout 5000
Parallelism 32
max_active_task_per_dag 16

@tatiana
Copy link
Collaborator

tatiana commented Jun 17, 2024

PR #1014 is significantly improving the performance with LoadMode.DBT_LS.

@tatiana
Copy link
Collaborator

tatiana commented Jun 26, 2024

If using LoadMode.DBT_LS, please, could you try Cosmos 1.5.0a9, which will be released as a stable version this week?

Some ways to improve the performance using Cosmos 1.4:

1. Can you pre-compile your dbt project?

If yes, this would remove this responsibility from the Airflow DAG processor, greatly impacting the DAG parsing time. You could try this by using and specifying the path to the manifest file:

DbtDag(
    ...,
    render_config=RenderConfig(
        load_method=LoadMode.DBT_MANIFEST
    )
)

More information: https://astronomer.github.io/astronomer-cosmos/configuration/parsing-methods.html#dbt-manifest

2. If you need to use LoadMode.DBT_LS, can you pre-install dbt dependencies in the Airflow scheduler and worker nodes?

If yes, this will avoid Cosmos having to run dbt deps all the time before running any dbt command, both in the scheduler and worker nodes. In that case, you should set:

DbtDag(
    ...,
    operator_args={"install_deps": False}
    render_config=RenderConfig(
        dbt_deps=False
    )
)

More info:
https://astronomer.github.io/astronomer-cosmos/configuration/render-config.html

3. If you need to use LoadMode.DBT_LS, is your dbt project large? Could you use selectors to select a subset?

jaffle_shop = DbtDag(
    render_config=RenderConfig(
        select=["path:analytics"],
    )
)

More info: https://astronomer.github.io/astronomer-cosmos/configuration/selecting-excluding.html

@tatiana tatiana closed this as completed Jun 26, 2024
@kzajaczkowski
Copy link

@tatiana, thank you for detailed explanations. They're helpful. Is using the manifest a recommended mode in terms of performance? If so will it still be the case after 1.5.0 is released?

@tatiana
Copy link
Collaborator

tatiana commented Jun 26, 2024

@kzajaczkowski, it really depends on your team's needs!

Using manifest is the safest approach in production from the perspective you're fully off-loading the Airflow DAG processor from ever having to run the dbt ls command. As you know, the dbt ls command can take some time to run, and it is proportional to the size of the project.

We understand that it is handy not to have to pre-compile the dbt project for many teams. If it is acceptable for the team that from time to time, the Airflow DAG processor will have to run dbt ls and that the cache will be stored in Airflow Variables, then performance-wise, the 1.5 release is the best performance we managed to achieve with LoadMode.DBT_LS.

In the past weeks, we've collaborated with customers who tested every iteration of the 1.5 alphas. They will deploy them live once we release the production version, giving us confidence in the stability of what we've built.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc dbt:run Primarily related to dbt run command or functionality epic-assigned execution:local Related to Local execution environment
Projects
None yet
Development

No branches or pull requests

3 participants