Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance degradation #932

Closed
liranc1 opened this issue Apr 30, 2024 · 7 comments
Closed

performance degradation #932

liranc1 opened this issue Apr 30, 2024 · 7 comments
Assignees
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:performance Related to performance, like memory usage, CPU usage, speed, etc epic-assigned profile:snowflake Related to Snowflake ProfileConfig
Milestone

Comments

@liranc1
Copy link

liranc1 commented Apr 30, 2024

Before using cosmos Airflow dag was running for about 15 min for a certain dbt command.
After the change to cosmos, the same dbt command is much more volatile, often taking 20-30 min.
All Airflow's resources stayed the same for the dag, and there was no change in the dbt connection details.

I also encountered some tasks randomly failing due to connection error (snowflake), that was successful on the next run. This issue did not occur without cosmos.

cosmos configurations used:

ExecutionConfig(dbt_executable_path=DBT_EXECUTABLE_PATH)

RenderConfig(
select=["models"],
test_behavior=TestBehavior.NONE,
load_method=LoadMode.DBT_LS,
dbt_deps=False
)

ProjectConfig(os.environ["DBT_PROJECT_PATH"], dbt_vars=dbt_vars)

dbt version:
Core:

  • installed: 1.5.6
    Plugins:
  • snowflake: 1.5.6
Copy link

dosubot bot commented Apr 30, 2024

To address the performance degradation and intermittent connection errors with your Airflow DAGs using cosmos for dbt tasks, consider the following targeted steps:

  1. Optimize Dependency Installation: If dependencies are installed with each run, this can slow down execution. After the initial run, set install_deps to False to reuse the environment, as shown in the cosmos_manifest_example.py.

  2. Parallel Model Execution: Ensure your Airflow configuration supports parallel task execution to fully utilize dbt's parallel model execution capabilities.

  3. Resource Allocation: Check that your Airflow and dbt tasks have sufficient memory and CPU resources.

  4. Data Volume and Complexity: For large or complex data transformations, optimize your dbt models or break down tasks into smaller segments.

  5. Network Latency: If Airflow and the database are not in the same network, reduce network latency by optimizing your network configuration.

For the Snowflake connection issues:

  • Ensure Correct Configuration: Verify that your Airflow Snowflake connection is correctly configured, especially for key-pair authentication using environment variables for sensitive information like private_key and private_key_passphrase.
  • Review Airflow and dbt Logs: Detailed error messages in these logs can provide insights into the specific issues.
  • Network and Snowflake Warehouse Performance: Assess network setup and Snowflake warehouse configuration to ensure they are optimized for your workload.
  • Update Dependencies: Ensure all related dependencies are up to date to benefit from performance improvements and bug fixes.

These steps are derived from analyzing the provided configuration and issues. Implementing these recommendations should help mitigate the performance and connectivity issues you're experiencing.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@dosubot dosubot bot added area:config Related to configuration, like YAML files, environment variables, or executer configuration area:performance Related to performance, like memory usage, CPU usage, speed, etc profile:snowflake Related to Snowflake ProfileConfig labels Apr 30, 2024
@tatiana
Copy link
Collaborator

tatiana commented Apr 30, 2024

Hi @liranc1 thanks for sharing this information, could you please confirm which version of Cosmos you were using before and after this performance degradation was observed?

@liranc1
Copy link
Author

liranc1 commented Apr 30, 2024

Cosmos was not used before, I was using bash operator.
The performance degradation started once I moved to Cosmos.
The cosmos version I used is 1.3.2

@tatiana
Copy link
Collaborator

tatiana commented May 14, 2024

@liranc1 could you try out Cosmos 1.4 and let us know if there are any performance improvements?

@tatiana tatiana added this to the 1.5.0 milestone May 17, 2024
@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@tatiana tatiana self-assigned this May 17, 2024
@tatiana
Copy link
Collaborator

tatiana commented Jun 5, 2024

Some progress: #1014.

@tatiana
Copy link
Collaborator

tatiana commented Jun 17, 2024

The previously mentioned PR, #1014, is for review and seems to have promising results

@tatiana
Copy link
Collaborator

tatiana commented Jun 26, 2024

If using LoadMode.DBT_LS, please, could you try Cosmos 1.5.0a9, which will be released as a stable version this week?

Some ways to improve the performance using Cosmos 1.4:

1. Can you pre-compile your dbt project?

If yes, this would remove this responsibility from the Airflow DAG processor, greatly impacting the DAG parsing time. You could try this by using and specifying the path to the manifest file:

DbtDag(
    ...,
    render_config=RenderConfig(
        load_method=LoadMode.DBT_MANIFEST
    )
)

More information: https://astronomer.github.io/astronomer-cosmos/configuration/parsing-methods.html#dbt-manifest

2. If you need to use LoadMode.DBT_LS, can you pre-install dbt dependencies in the Airflow scheduler and worker nodes?

If yes, this will avoid Cosmos having to run dbt deps all the time before running any dbt command, both in the scheduler and worker nodes. In that case, you should set:

DbtDag(
    ...,
    operator_args={"install_deps": False}
    render_config=RenderConfig(
        dbt_deps=False
    )
)

More info:
https://astronomer.github.io/astronomer-cosmos/configuration/render-config.html

3. If you need to use LoadMode.DBT_LS, is your dbt project large? Could you use selectors to select a subset?

jaffle_shop = DbtDag(
    render_config=RenderConfig(
        select=["path:analytics"],
    )
)

More info: https://astronomer.github.io/astronomer-cosmos/configuration/selecting-excluding.html

4. Are you able to install dbt in the same Python virtual environment as you have Airflow installed?

If this is a possibility, you'll be able to experience significant performance improvements by leveraging the InvocationMode.DBT_RUNNER method, that is switched on by default since Cosmos 1.4.

More information:
https://astronomer.github.io/astronomer-cosmos/getting_started/execution-modes.html#invocation-modes

@tatiana tatiana closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:performance Related to performance, like memory usage, CPU usage, speed, etc epic-assigned profile:snowflake Related to Snowflake ProfileConfig
Projects
None yet
Development

No branches or pull requests

2 participants