Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1760] [Bug] Python resource_tracker causing dbt test to fail #6535

Closed
2 tasks done
alexrosenfeld10 opened this issue Jan 6, 2023 · 5 comments
Closed
2 tasks done
Labels
bug Something isn't working

Comments

@alexrosenfeld10
Copy link
Contributor

alexrosenfeld10 commented Jan 6, 2023

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

We're seeing sporadic errors causing our production Argo Workflow runs to fail. This is not a consistent issue nor is it easily reproducible, but when it happens the dbt test command fails and the pod is terminated with code 143. This is obviously not a good thing in a production environment, and it causes alerts that go to on call engineers.

For full clarity, I will post the various layers that ultimately invoke dbt test.

The runs execute the following Argo Workflow template:

          - name: test-prod
            when: "'{{workflow.parameters.snowflake-env}}' =~ '^prod'"
            inline:
              container:
                image: my-image
                command: ["./execute_domain.sh"]
                args: [test, --select, published]

The execute_domain.sh script looks like this (slightly abridged, but this is where the rubber meets the road):

echo "{ \"level\": \"info\", \"message\": \"running command 'dbt ${DBT_ARGS}'\"}"
dbt --no-use-colors --profiles-dir . --log-format json "$@"

The error that occurs during the run is:

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
time="2023-01-05T19:25:53.702Z" level=warning msg="Non-transient error: context canceled"

Here is a screenshot detailing the error happening during the test run, and not during any other time:

Screen Shot 2023-01-06 at 9 56 33 AM

Note: I realize the logs indicate that the resource_tracker issue is a warning level log, implying it's not the root cause of the failure... However, there are no other relevant logs output by dbt during these failures, so this is all I have to go on.

Expected Behavior

dbt test should run end to end without crashing.

Steps To Reproduce

This is challenging to reproduce. Let's get a discussion going, and from there if no one on the dbt Labs side of things has ideas, we can investigate ways to reproduce this in a shareable fashion.

Relevant log output

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
time="2023-01-05T19:25:53.702Z" level=warning msg="Non-transient error: context canceled"

Environment

  • OS:
# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • Python:
# python --version
Python 3.10.3
  • dbt:
# dbt --version
Core:
  - installed: 1.1.0
  - latest:    1.3.2 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - snowflake: 1.1.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Which database adapter are you using with dbt?

No response

Additional Context

No response

@alexrosenfeld10 alexrosenfeld10 added bug Something isn't working triage labels Jan 6, 2023
@github-actions github-actions bot changed the title [Bug] Python resource_tracker causing dbt test to fail [CT-1760] [Bug] Python resource_tracker causing dbt test to fail Jan 6, 2023
@jtcohen6 jtcohen6 self-assigned this Jan 9, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 9, 2023

@alexrosenfeld10 Thanks for opening!

After a bit of googling, it seems like this will be tricky to reproduce. When a Python application is running multiple processes (as dbt does to handle threaded execution), this warning crops up when the application is terminated partway through, e.g. if it hits a memory limit. For example:

Is there any chance your Argo Workflow is running out of memory? I'd hope that there'd be other error messages to say something to that effect, and explain why the process is being killed...

But it also sounds like this is related to a bug in Python 3.8+, related to a switch within the multiprocessing library to use spawn instead of fork (!). Check out the bug here: python/cpython#90549. It looks like there might have been a fix for it merged last summer, and included in Python 3.11?

@jtcohen6 jtcohen6 removed their assignment Jan 9, 2023
@alexrosenfeld10
Copy link
Contributor Author

alexrosenfeld10 commented Jan 9, 2023

@jtcohen6 thanks for the reply! I wonder if the best course of action here is to simply wait for dbt-core to support python 3.11? Do you know when that might be?

I'm unable to check the memory history on those pods (they've been churned and our monitoring software no longer has the data). I can check if / when this happens again. However, I'd be pretty surprised if they ran out of memory, as all they're responsible for is dbt test, which I believe is a relatively lightweight process? Most of the heavy lifting there happens in the data warehouse, I'd expect.

I didn't see any other error messages related to pods being OOMKilled, which I would've expected to see as well if that was the issue.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 9, 2023

I wonder if the best course of action here is to simply wait for dbt-core to support python 3.11? Do you know when that might be?

dbt-core will support Py311 in v1.4 (next!), currently in beta, release candidate coming this week. Support on dbt-snowflake will take a little longer because we're waiting on snowflake-connector-python: dbt-labs/dbt-snowflake#331

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jan 9, 2023

However, I'd be pretty surprised if they ran out of memory, as all they're responsible for is dbt test, which I believe is a relatively lightweight process?

Agreed, I wouldn't have expected that either. It would make more sense if it were a more memory-intensive command, e.g. dbt docs generate.

I missed this on the first read, but I do see a message Error: exit status 143, which means the container running dbt received SIGTERM (this link among many others). Is it possible that Argo/k8s are killing the process due to resource limitations, timeout, a new workflow being triggered, ... any other imaginable reason?

My hypothesis remains that, while the leaked semaphore warning is pretty gross, it's actually a red herring. The process is being terminated for some reason, and there's a bug in Python multiprocessing 3.8-3.10 which causes that warning to appear whenever a multi-threaded process is interrupted.

@alexrosenfeld10
Copy link
Contributor Author

dbt-core will support Py311 in v1.4 (next!), currently in beta, release candidate coming this week. Support on dbt-snowflake will take a little longer because we're waiting on snowflake-connector-python: dbt-labs/dbt-snowflake#331

Glad dbt has their sights set on the new version! Thanks for the info. I've had to fork the snowflake connectors and DIY stuff because their development is fairly slow (that said, I'm using the .NET connector which I understand isn't their main marketshare). Hopefully they come through w/python 3.11 support soonish!

It's possible it's being interrupted somewhere somehow, but I don't see any other indicators of that in our system. We'll look for more details if it pops up again. I'll close this out, as the next actionable thing here is upgrade python + collect more data. Thanks again for the info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants